diff --git a/s2_organisation_and_version_control/git/index.html b/s2_organisation_and_version_control/git/index.html index 9e4640be3..2ec36acf4 100644 --- a/s2_organisation_and_version_control/git/index.html +++ b/s2_organisation_and_version_control/git/index.html @@ -2033,7 +2033,7 @@
This will create a local copy of the repository which you have complete writing access to. Note that -code updates to the original repository does not update code in your local repository.
+code updates to the original repository do not update code in your local repository.Clone your local fork of the project using git clone
.
Machine Learning Operations
Repository for course 02476 at DTU.
Checkout the homepage!
"},{"location":"#i-course-information","title":"\u2139\ufe0f Course information","text":"
Course responsible
Postdoc Nicki Skafte Detlefsen, nsde@dtu.dk
Professor S\u00f8ren Hauberg, sohau@dtu.dk
5 ECTS (European Credit Transfer System), corresponding to 140 hours of work
Recommended prerequisites: DTU course 02456 (Deep Learning) or experience with the following topics:
General understanding of machine learning (datasets, probability, classifiers, overfitting etc.)
Start by cloning or downloading this repository
git clone https://github.com/SkafteNicki/dtu_mlops\n
If you do not have git installed (yet) we will touch upon it in the course. The folder will contain all the exercise material for this course and lectures. Additionally, you should join our Slack channel which we use for communication. The link may be expired, write to me.
"},{"location":"#course-organization","title":"\ud83d\udcc2 Course organization","text":"We highly recommend that when going through the material you use the homepage which is the corresponding Github pages version of this repository that is more nicely rendered, that also includes some special HTML magic provided by Material for MkDocs.
The course is divided into sessions, denoted by capital S, and modules, denoted by capital M. A session corresponds to a full day of work if you are following the course, meaning approximately 9 hours of work. Each session (S) corresponds to a topic within MLOps and consists of multiple modules (M) that each cover a tool within the session.
Importantly we differ between core modules and optional modules. Core modules will be marked by
Core Module
at the top of their corresponding page. Core modules are important to go through to be able to pass the course. You are highly recommended to still do the optional modules.
"},{"location":"#mlops-what-is-it","title":"\ud83c\udd92 MLOps: What is it?","text":"Machine Learning Operations (MLOps) is a rather new field that has seen its uprise as machine learning and particularly deep learning has become a widely available technology. The term itself is a compound of \"machine learning\" and \"operations\" and covers everything that has to do with the management of the production ML lifecycle.
The lifecycle of production ML can largely be divided into three phases:
Design: The initial phase starts with an investigation of the problem. Based on this analysis, several requirements can be prioritized for what we want our future model to do. Since machine learning requires data to be trained, we also investigate in this step what data we have and if we need to source it in some other way.
Model development: Based on the design phase we can begin to conjure some machine learning algorithms to solve our problems. As always, the initial step often involves doing some data analysis to make sure that our model is learning the signal that we want it to learn. Secondly, is the machine learning engineering phase, where the particular model architecture is chosen. Finally, we also need to do validation and testing to make sure that our model is generalizing well.
Operations: Based on the model development phase, we now have a model that we want to use. The operations are where create an automatic pipeline that makes sure that whenever we make changes to our codebase they get automatically incorporated into our model, such that we do not slow down production. Equally important is the ongoing monitoring of already deployed models to make sure that they behave exactly as we specified them.
It is important to note that the three steps are a cycle, meaning that when you have successfully deployed a machine learning model that is not the end of it. Your initial requirements may change, forcing you to revisit the design phase. Some new algorithms may show promising results, so you revisit the model development phase to implement this. Finally, you may try to cut the cost of running your model in production, making you revisit the operations phase, and trying to optimize some steps.
The focus in this course is particularly on the Operations part of MLOps as this is what many data scientists are missing in their toolbox to take all the knowledge they have about data processing and model development into a production setting.
"},{"location":"#learning-objectives","title":"\u2754 Learning objectives","text":"General course objective
Introduce the student to a number of coding practices that will help them organization, scale, monitor and deploy machine learning models either in a research or production setting. To provide hands-on experience with a number of frameworks, both local and in the cloud, for doing large scale machine learning models.
This includes:
Additional reading resources (in no particular order):
Ref 1 Introduction blog post for those who have never heard about MLOps and want to get an overview.
Ref 2 Great document from Google about the different levels of MLOps.
Ref 3 Another introduction to the principles of MLOps and the different stages of MLOps.
Ref 4 Great paper about the technical depth in machine learning.
Ref 5 Interview study that uncovers many of the pain points that ML engineers go through when doing MLOps.
Other courses with content similar to this:
Made with ML. Great online MLOps course that also covers additional topics on the foundations of working with ML.
Full stack deep learning. Another MLOps online course going through the whole developer pipeline.
MLOps Zoomcamp. MLOps online course that includes many of the same topics.
If you want to contribute to the course, we are happy to have you! Anything from fixing typos to adding new content is welcome. For building the course material locally, it is a simple two-step process:
pip install -r requirements.txt\nmkdocs serve\n
Which will start a local server that you can access at localhost:8000
and will automatically update when you make changes to the course material. When you have something that you want to contribute, please make a pull request.
I highly value open source, and the content of this course is therefore free to use under the Apache 2.0 license. If you use parts of this course in your work, please cite using:
@misc{skafte_mlops,\n author = {Nicki Skafte Detlefsen},\n title = {Machine Learning Operations},\n howpublished = {\\url{https://github.com/SkafteNicki/dtu_mlops}},\n year = {2024}\n}\n
"},{"location":"challenges/","title":"Challenges","text":"If you have managed to go through all other material, congratulations, you are already a good way to becoming an MLOps engineer with a great overview of tools, concepts and techniques within the field. Below are listed some technical hard problems regarding MLOps. These are meant as inspiration to get you to deep dive more into using all the cloud services that gcp
offers. You are also free to continue work on your project.
Currently testing takes place in Github, but it should come as no surprise that gcp
can also take care of this. Implementing testing on gcp
. This blogpost can probably help.
In the lectures we setup cloud build to automatically build a docker container for training whenever we pushed code to our github repository. However, we also setup CI testing in github. If tests are failing on github the building of the docker image is still being done, essentially wasting our precious cloud credit. Setup a system so cloud building only commence when all tests are passing.
Authenticating between gcp
, wandb
and dvc
can be tricky to do in a secure way. Figure out how to use the Secret Manager in gcp
to pass secrets e.g. API keys during the build process of docker images. This page may help
We have already done deployment through Cloud Functions
. The native extension to cloud functions is the service Cloud Run
which allows for more than just code snippets to be deployed. Checkout this service and try to deploy a container using it.
All deployments we have done in the course have been serverless, because it makes it easier for us to focus on the actual application we are trying to deploy instead of focusing on server management. That said, going through the trouble of using a server orchestrator yourself can be worth it in many situations. Figure out how to use kubernetes in gcp
. It will involve getting familiar with the kubernetes API and probably also kubeflow for managing pipelines on the server.
Vertex AI is the newest ML service on gcp
. It combines many of the features of the AI platform service you have already used with the AutoML service. Figure out how to use Vertex AI service to either train a custom model or use their AutoML feature. This blogpost can be a good place to start.
If you want different services to be able to talk to each other the correct way is to setup a system using Pub and Sub (publish and subscription) service in gcp
. Essentially it allows a service to publish a message and other services to subscribe and react to it. For example the AI platform could publish a message every time a model was done training and cloud build could subscribe to that, automatically staring to build a docker image using the trained model. Investigate Pub and Sub and try to make two services talk to each other.
In the deployment exercises you probably looked at least once on the logs. We can automate what we do with the logs using the Logs Explorer service, which collects all logs from all services that you are using. Setup Logs routing for one of your deployed services to your cloud storage. Afterwards setup a VM that consumes the logs and accumulate them.
For further questions, please contact Nicki.
"},{"location":"faq/#is-it-possible-to-attend-the-course-fully-online","title":"Is it possible to attend the course fully online \u2754","text":"Mostly yes. All exercises are provided online and lectures will be recorded and streamed. However, do note that
Overall we try to support flexible learning as much as possible with some limitations.
"},{"location":"faq/#what-are-the-prerequisites-for-taking-this-course","title":"What are the prerequisites for taking this course \u2754","text":"We recommend that you have a basic understanding of machine learning concepts such as what a dataset is, what probabilities are, what a classifier is, what overfitting means etc. This corresponds to the curriculum covered in course 02450. The actual focus of the course is not on machine learning models, but we will be using these basic concepts throughout the exercises.
Additionally, we recommend basic knowledge about deep learning and how to code in Pytorch, corresponding to the curriculum covered in 02456. From prior experience, we know that not all students have gained knowledge about deep learning models before this course, and we will be covering the basics of how to code in PyTorch in one of the first modules of the course to get everyone up to speed.
"},{"location":"faq/#i-will-be-missing-x-days-of-the-course-will-that-be-a-problem","title":"I will be missing X days of the course, will that be a problem \u2754","text":"Depends. The course is fairly intensive, with most students working from 9-17 every day. If you already know that you will be missing X days of the course, then I highly recommend that you go through some of the first sessions before the course starts to give yourself a bit of breathing room. If you are not able to do so, please be aware that an additional effort may be needed from you to keep up with your fellow students.
"},{"location":"faq/#how-many-should-we-be-in-a-group-for-the-projects","title":"How many should we be in a group for the projects \u2754","text":"Between 3 and 5. The projects are designed to be done in groups meaning that we intentionally make them too big for one person to do alone. Luckily, a lot of the work that you need to do can be done in parallel, so it is not as bad as it sounds.
"},{"location":"faq/#when-will-the-exam-take-place","title":"When will the exam take place \u2754","text":"The oral part of the exam, which is a small project demo, always falls on the last day of the course. For January 2024, this means the 19th. The written part which is a small project report, should be handed in at midnight on the final course day.
"},{"location":"faq/#where-can-i-find-information-regarding-the-exam","title":"Where can I find information regarding the exam \u2754","text":"Look at the bottom of this page. Details will be updated as we get closer to the exam date.
"},{"location":"faq/#can-i-use-chatgpt-or-similar-tools-for-the-exercises-project-exam-report-coding-writing","title":"Can I use ChatGPT or similar tools for the exercises, project, exam report (coding + writing) \u2754","text":"Yes, yes, and yes, but remember that its a tool and you need to validate the output before using it. We would prefer for the exam report that you formulate the answers in your own words because it is intended for you do describe what you have been doing in your project. The I in LLM stands for intelligence.
"},{"location":"faq/#i-am-a-foreign-student-and-my-home-university-doesnt-accept-passnot-pass-what-can-i-do","title":"I am a foreign student and my home university doesn't accept pass/not pass, what can I do \u2754","text":"We can give a grade on the Danish 7-point grading scale for foreign students who need it, where their home university does not accept pass/no-pass. You need to contact the course responsible Nicki within the first week of the course to request this. Secondly, make sure to also inform us about it during the oral part of the exam because we need to ask you additional questions to be able to give an exact grade.
"},{"location":"faq/#i-am-a-euroteq-student-any-special-rules-for-me","title":"I am a EuroTEQ student, any special rules for me \u2754","text":"You will be allowed to attend the oral part of the exam online and we will provide a special Slack channel for you, trying to make sure that you get the same help as students from DTU who can attend the course on campus.
"},{"location":"overview/","title":"Summary of course content","text":"There are a lot of moving parts in this course, so it may be hard to understand how it all fits together. This page provides a summary of the frameworks in this course e.g. the stack of tools used. In the figure below we have provided an overview on how the different tools of the course interacts with each other. The table after the figure provides a short description of each of the parts.
The MLOps stack in the course. This is just an example of one stack, and depending on your use case you may want to use a different stack of tools that better fits your needs. Regardless of the stack, the principles of MLOps are the same. Framework Description Pytorch is the backbone of our code, it provides the computational engine and the data structures that we need to define our data structures. Pytorch lightning is a framework that provides a high-level interface to Pytorch. It provides a lot of functionality that we need to train our models, such as logging, checkpointing, early stopping, etc. such that we do not have to implement it ourselves. It also allows us to scale our models to multiple GPUs and multiple nodes. We control the dependencies and python interpreter using Conda that enables us to construct reproducible virtual environments For configuring our experiments we use Hydra that allows us to define a hierarchical configuration structure config files Using Weights and Bias allows us to track and log any values and hyperparameters for our experiments Whenever we run into performance bottlenecks with our code we can use the Profiler to find the cause of the bottleneck When we run into bugs in our code we can use the Debugger to find the cause of the bug For organizing our code and creating templates we can use Cookiecutter Docker is a tool that allows us to create a container that contains all the dependencies and code that we need to run our code For controlling the versions of our data and synchronization between local and remote data storage, we can use DVC that makes this process easy For version control of our code we use Git (in complement with Github) that allows multiple developers to work together on a shared codebase We can use Pytest to write unit tests for our code, to make sure that new changes to the code does break the code base For linting our code and keeping a consistent coding style we can use tools such as Pylint and Flake8 that checks our code for common mistakes and style issues For running our unit tests and other checks on our code in a continues manner e.g. after we commit and push our code we can use Github actions that automate this process Using Cloud build we can automate the process of building our docker images and pushing them to our container registry Container registry is a service that allows us to store our docker images for later use by other services For storing our data and trained models we can use Cloud storage that provides a scalable and secure storage solution For general compute tasks we can use Compute engine that provides a scalable and secure compute solution For training our experiments in a easy and scalable manner we can use Vertex AI For creating a REST API for our model we can use FastAPI that provides a high-level interface for creating APIs For simple deployments of our code we can use Cloud functions that allows us to run our code in response to events through simple python functions For more complex deployments of our code we can use Cloud run that allows us to run our code in response to events through docker containers Cloud monitoring gives us the tools to keep track of important logs and errors from the other cloud services For monitoring our deployed model is experiencing any drift we can use Evidently AI that provides a framework and dashboard for monitoring drift For monitoring the telemetry of our deployed model we can use OpenTelemetry that provides a standard for collecting and exporting telemetry data"},{"location":"projects/","title":"Project work","text":"Slides
Approximately 1/3 of the course time is dedicated to doing project work. The projects will serve as the basis of your exam. In the project, you will essentially re-apply everything that you learn throughout the course to a self chosen project. The overall goals with the project is:
In the projects you are free to work on whatever problem that you want. That said, we have a specific requirement, that you need to incorporate some third-party framework into your project. If you want inspiration for projects, here are some examples
Classification of tweets
Translating from English to German
Classification of scientific papers
Classification of rice types from images
We hope most students will be able to form groups by themselves. Expected group size is between 3 and 5. If you are not able to form a group, please make sure to post in the #looking-for-group
channel on Slack or make sure to be present on the 4th day of the course (the day before the project work starts) where we will help students that have not found a group yet.
We strive to keep the tools thought in this course as open-source as possible. The great thing about the open-source community is that whatever problem you are working on, there is probably some package out there that can get you at least 10% of the way. For the project, we want to enforce this point and you are required to include some third-party package, that is neither Pytorch or one of the tools already covered in the course, into your project.
If you have no idea what framework to include, the Pytorch ecosystem is a great place for finding open-source frameworks that can help you accelerate your own projects where Pytorch is the backengine. All tools in the ecosystem should work greatly together with Pytorch. However, it is important to note that the ecosystem is not a complete list of all the awesome packages that exist to extend the functionality of Pytorch. If you are still missing inspiration for frameworks to use, we highly recommend these three that have been used in previous years of the course:
PyTorch Image Models. PyTorch Image Models (also known as TIMM) is the absolutely most used computer vision package (maybe except for torchvision
). It contains models, scripts and pre trained for a lot of state-of-the-art image models within computer vision.
Transformers. The Transformers repository from the Huggingface group focuses on state-of-the-art Natural Language Processing (NLP). It provides many pre-trained model to perform tasks on texts such as classification, information extraction, question answering, summarization, translation, text generation, etc in 100+ languages. Its aim is to make cutting-edge NLP easier to use for everyone.
Pytorch-Geometric. PyTorch Geometric (PyG) is a geometric deep learning. It consists of various methods for deep learning on graphs and other irregular structures, also known as geometric deep learning, from a variety of published papers.
Each project day is fully dedicated to project work, except for maybe external inspirational lectures in the morning. The group decides exactly where they want to work on the project, how they want to work on the project, how do distribute the workload etc. We actually encourage strongly to parallelize work during the project, because there are a lot of tasks to do, but it it is important that all group members at least have some understanding of the whole project.
Remember that the focus of the project work is not to demonstrate that you can work with the biggest and baddest deep learning model, but instead that you show that you can incorporate the tools that are taught throughout the course in a meaningful way.
Also note that the project is not expected to be very large in scope. It may simply be that you want to train X model on Y data. You will approximately be given 4 full days to work on the project. It is better that you start out with a smaller project and then add complexity along the way if you have time.
"},{"location":"projects/#day-1","title":"Day 1","text":"The first project days is all about getting started on the projects and formulating exactly what you want to work on as a group.
Start by brainstorming projects! Try to figure out exactly what you want to work with and begin to investigate what third party package that can support the project.
When you have come up with an idea, write a project description. The description is the delivery for today and should be at least 300 words. Try to answer the following questions in the description:
(Optional) If you want to think more about the product design of your project, feel free to fill out the ML canvas (or part of it). You can read more about the different fields on canvas here.
After having done the product description, you can start on the actual coding of the project. In the next section, a to-do list is attached that summaries what we are doing in the course. You are NOT expected to fulfill all bullet points from week 1 today.
The project description will serve as an guideline for us at the exam that you have somewhat reached the goals that you set out to do. By the end of the day, you should commit your project description to the README.md
file belonging to your project repository. If you filled out the ML canvas, feel free to include that as part of the README.md
file. Also remember to commit whatever you have done on the project until now. When you have done this, go to DTU Learn and hand-in (as a group) the link to your github repository as an assignment.
We will briefly (before next Monday) look over your github repository and project description to check that everything is fine. If we have any questions/concerns we will contact you.
"},{"location":"projects/#day-2","title":"Day 2","text":"The goal for today is simply to continue working on your project. Start with bullet points in the checklist from week 1 and continue with bullet points for week 2.
"},{"location":"projects/#day-3","title":"Day 3","text":"Continue working on your project, today you should hopefully focus on the bullet points in the checklist from week 2. There is no delivery for this week, but make sure that you have committed all your progress at the end of the day. We will again briefly look over the repositories and will reach out to your group if we are worried about the progression of your project.
"},{"location":"projects/#day-4","title":"Day 4","text":"We have now entered the final week of the course and the second last project day. You are most likely continuing with bullet points from week 2, but should hopefully begin to look at the bullet points from week 3 today. These are in general much more complex, so we recommend looking at them until you have completed most from week 2. We also recommend that you being to fill our report template.
"},{"location":"projects/#day-5","title":"Day 5","text":"Today you are finishing your project. We recommend that you start by creating a architechtual overview of your project similar to this figure. I recommend using draw.io for creating this kind of diagram, but feel free to use any tool you like. Else you should just continue working on your project, checking of as many bullet points as possible. Finally, you should also prepare yourself for the exam tomorrow.
"},{"location":"projects/#project-checklist","title":"Project checklist","text":"Please note that all the lists are exhaustive meaning that I do not expect you to have completed very point on the checklist for the exam.
"},{"location":"projects/#week-1","title":"Week 1","text":"make_dataset.py
file such that it downloads whatever data you need andrequirements.txt
file with whatever dependencies that you are usingpep8
) while doing the projectThe exam consist of a written and oral element, and both contributes to the overall evaluation if you should pass or not pass the course.
For the written part of the exam we provide an template folder called reports. As the first task you should copy the folder and all its content to your project repository. Then, you jobs is to fill out the README.md
file which contains the report template. The file itself contains instructions on how to fill it out and instructions on using the included report.py
file. You will hand-in the template by simple including it in your project repository. By midnight on the 20/1 we will scrape it automatically, and changes after this point are therefore not registered.
For the oral part of the exam you will be given a time slot where you have to show up for 5-7 min and give a very short demo of your project. What we are interested in seeing is essentially a live demo of your deployed application/project. We will possibly also ask questions regarding the overall curriculum of the course. Importantly, you should have your deployed application, the github repository with your project code, W&B account and your GCP account ready before you enter the exam so we can quickly jump around. We will send out an the time slots during the last week.
"},{"location":"timeplan/","title":"Timeplan","text":"Slides
The course is organised into exercise (2/3 of the course) days and project days (1/3 of the course).
Exercise days start at 9:00 in the morning with an lecture (15-30 min) that will give some context about at least one of the topics of that day. Additionally, previous days exercises may shortly be touched upon. The remaining of the day will be spend on solving exercises either individually or in small groups. For some people the exercises may be fast to do and for others it will take the hole day. We will provide help throughout the day. We will try to answer questions on slack but help with be priorities to students physically on campus.
Project days are intended for project work and you are therefore responsible for making an agreement with your group when and where you are going to work. The first project days there will be a lecture at 9:00 with project information. Other project days we may also start the day with an external lecture, which we highly recommend that you participate in. During each project day we will have office hours where you can ask questions for the project.
Below is an overall timeplan for each day, including the presentation topic of the day and the frameworks that you will be using in the exercises.
Legend: \ud83d\udcdd Slides, \ud83c\udfa5 Recording.
Note
Current dates listed below are for January 2024 version of the course. The lectures and recordings are currently from January 2023 version of the course. Please note that for January 2024, the first week starts on a Tuesday and ends on a Saturday.
"},{"location":"timeplan/#week-1","title":"Week 1","text":"In the first week you will be introduced to a number of development practices for organizing and developing code, especially with a focus on making everything reproducible.
Date Day Presentation topic Frameworks Format 2/1 Tuesday Deep learning software \ud83d\udcdd \ud83c\udfa5(2023) \ud83c\udfa5(2024) Terminal, Conda, IDE, Pytorch Exercises 3/1 Wednesday MLOps: what is it? \ud83d\udcdd.pdf) \ud83c\udfa5(2023) \ud83c\udfa5(2023) Git, CookieCutter, Pep8, DVC Exercises 4/1 Thursday Reproducibility \ud83d\udcdd \ud83c\udfa5(2023) \ud83c\udfa5(2024) Docker, Hydra Exercises 5/1 Friday Debugging \ud83d\udcdd \ud83c\udfa5(2023) \ud83c\udfa5(2024) Debugger, Profiler, Wandb, Lightning Exercises 6/1 Saturday Pytorch ecosystem \ud83d\udcdd \ud83c\udfa5(2023) \ud83c\udfa5(2024) - Projects"},{"location":"timeplan/#week-2","title":"Week 2","text":"The second week is about automatization and the cloud. Automatization will help use making sure that our code does not break when we make changes to it. The cloud will help us scale up our applications and we learn how to use different services to help develop a full machine learning pipeline.
Date Day Presentation topic Frameworks Format 8/1 Monday Continuous Integration \ud83d\udcdd \ud83c\udfa5 Pytest, Github actions, Pre-commit, CML Exercises 9/1 Tuesday The Cloud \ud83d\udcdd \ud83c\udfa5 GCP Engine, Bucket, Container registry, Vertex AI Exercises 10/1 Wednesday Deployment \ud83d\udcdd \ud83c\udfa5 FastAPI, Torchservce, GCP Functions, Run Exercises 11/1 Thursday No lecture \ud83c\udfa5 - Projects 12/1 Friday No lecture \ud83c\udfa5 - Projects"},{"location":"timeplan/#week-3","title":"Week 3","text":"For the final week we look into advance topics such as monitoring and scaling of applications. Monitoring is especially important for the longivity for the applications that we develop, that we actually can deploy them either locally or in the cloud and that we have the tools to monitor how they behave over time. Scaling of applications is an important topic if we ever want our applications to be used by many people at the same time.
Date Day Presentation topic Frameworks Format 15/1 Monday Monitoring \ud83d\udcdd \ud83c\udfa5 Evidently AI, OpenTelemetry, Signoz Exercises 16/1 Tuesday Scalable applications \ud83d\udcdd \ud83c\udfa5 Pytorch, Lightning Exercises 17/1 Wednesday - - Projects 18/1 Thursday - - Projects 19/1 Friday - - Exam"},{"location":"reports/","title":"Exam template for 02476 Machine Learning Operations","text":"This is the report template for the exam. Please only remove the text formatted as with three dashes in front and behind like:
--- question 1 fill here ---
where you instead should add your answers. Any other changes may have unwanted consequences when your report is auto generated in the end of the course. For questions where you are asked to include images, start by adding the image to the figures
subfolder (please only use .png
, .jpg
or .jpeg
) and then add the following code in your answer:
![my_image](figures/<image>.<extension>)\n
In addition to this markdown file, we also provide the report.py
script that provides two utility functions:
Running:
python report.py html\n
will generate an .html
page of your report. After deadline for answering this template, we will autoscrape everything in this reports
folder and then use this utility to generate an .html
page that will be your serve as your final handin.
Running
python report.py check\n
will check your answers in this template against the constrains listed for each question e.g. is your answer too short, too long, have you included an image when asked to.
For both functions to work it is important that you do not rename anything. The script have two dependencies that can be installed with pip install click markdown
.
The checklist is exhaustic which means that it includes everything that you could possible do on the project in relation the curricilum in this course. Therefore, we do not expect at all that you have checked of all boxes at the end of the project.
"},{"location":"reports/#week-1","title":"Week 1","text":"make_dataset.py
file such that it downloads whatever data you need andrequirements.txt
file with whatever dependencies that you are usingpep8
) while doing the projectEnter the group number you signed up on
Answer:
--- question 1 fill here ---
"},{"location":"reports/#question-2","title":"Question 2","text":"Enter the study number for each member in the group
Example:
sXXXXXX, sXXXXXX, sXXXXXX
Answer:
--- question 2 fill here ---
"},{"location":"reports/#question-3","title":"Question 3","text":"What framework did you choose to work with and did it help you complete the project?
Answer length: 100-200 words.
Example: We used the third-party framework ... in our project. We used functionality ... and functionality ... from the package to do ... and ... in our project.
Answer:
--- question 3 fill here ---
"},{"location":"reports/#coding-environment","title":"Coding environment","text":"In the following section we are interested in learning more about you local development environment.
"},{"location":"reports/#question-4","title":"Question 4","text":"Explain how you managed dependencies in your project? Explain the process a new team member would have to go through to get an exact copy of your environment.
Answer length: 100-200 words
Example: We used ... for managing our dependencies. The list of dependencies was auto-generated using ... . To get a complete copy of our development environment, one would have to run the following commands
Answer:
--- question 4 fill here ---
"},{"location":"reports/#question-5","title":"Question 5","text":"We expect that you initialized your project using the cookiecutter template. Explain the overall structure of your code. Did you fill out every folder or only a subset?
Answer length: 100-200 words
Example: From the cookiecutter template we have filled out the ... , ... and ... folder. We have removed the ... folder because we did not use any ... in our project. We have added an ... folder that contains ... for running our experiments. Answer:
--- question 5 fill here ---
"},{"location":"reports/#question-6","title":"Question 6","text":"Did you implement any rules for code quality and format? Additionally, explain with your own words why these concepts matters in larger projects.
Answer length: 50-100 words.
Answer:
--- question 6 fill here ---
"},{"location":"reports/#version-control","title":"Version control","text":"In the following section we are interested in how version control was used in your project during development to corporate and increase the quality of your code.
"},{"location":"reports/#question-7","title":"Question 7","text":"How many tests did you implement and what are they testing in your code?
Answer length: 50-100 words.
Example: In total we have implemented X tests. Primarily we are testing ... and ... as these the most critical parts of our application but also ... .
Answer:
--- question 7 fill here ---
"},{"location":"reports/#question-8","title":"Question 8","text":"What is the total code coverage (in percentage) of your code? If you code had an code coverage of 100% (or close to), would you still trust it to be error free? Explain you reasoning.
Answer length: 100-200 words.
Example: The total code coverage of code is X%, which includes all our source code. We are far from 100% coverage of our ** code and even if we were then...*
Answer:
--- question 8 fill here ---
"},{"location":"reports/#question-9","title":"Question 9","text":"Did you workflow include using branches and pull requests? If yes, explain how. If not, explain how branches and pull request can help improve version control.
Answer length: 100-200 words.
Example: We made use of both branches and PRs in our project. In our group, each member had an branch that they worked on in addition to the main branch. To merge code we ...
Answer:
--- question 9 fill here ---
"},{"location":"reports/#question-10","title":"Question 10","text":"Did you use DVC for managing data in your project? If yes, then how did it improve your project to have version control of your data. If no, explain a case where it would be beneficial to have version control of your data.
Answer length: 100-200 words.
Example: We did make use of DVC in the following way: ... . In the end it helped us in ... for controlling ... part of our pipeline
Answer:
--- question 10 fill here ---
"},{"location":"reports/#question-11","title":"Question 11","text":"Discuss you continues integration setup. What kind of CI are you running (unittesting, linting, etc.)? Do you test multiple operating systems, python version etc. Do you make use of caching? Feel free to insert a link to one of your github actions workflow.
Answer length: 200-300 words.
Example: We have organized our CI into 3 separate files: one for doing ..., one for running ... testing and one for running ... . In particular for our ..., we used ... .An example of a triggered workflow can be seen here:
Answer:
--- question 11 fill here ---
"},{"location":"reports/#running-code-and-tracking-experiments","title":"Running code and tracking experiments","text":"In the following section we are interested in learning more about the experimental setup for running your code and especially the reproducibility of your experiments.
"},{"location":"reports/#question-12","title":"Question 12","text":"How did you configure experiments? Did you make use of config files? Explain with coding examples of how you would run a experiment.
Answer length: 50-100 words.
Example: We used a simple argparser, that worked in the following way: python my_script.py --lr 1e-3 --batch_size 25
Answer:
--- question 12 fill here ---
"},{"location":"reports/#question-13","title":"Question 13","text":"Reproducibility of experiments are important. Related to the last question, how did you secure that no information is lost when running experiments and that your experiments are reproducible?
Answer length: 100-200 words.
Example: We made use of config files. Whenever an experiment is run the following happens: ... . To reproduce an experiment one would have to do ...
Answer:
--- question 13 fill here ---
"},{"location":"reports/#question-14","title":"Question 14","text":"Upload 1 to 3 screenshots that show the experiments that you have done in W&B (or another experiment tracking service of your choice). This may include loss graphs, logged images, hyperparameter sweeps etc. You can take inspiration from this figure. Explain what metrics you are tracking and why they are important.
Answer length: 200-300 words + 1 to 3 screenshots.
Example: As seen in the first image when have tracked ... and ... which both inform us about ... in our experiments. As seen in the second image we are also tracking ... and ...
Answer:
--- question 14 fill here ---
"},{"location":"reports/#question-15","title":"Question 15","text":"Docker is an important tool for creating containerized applications. Explain how you used docker in your experiments? Include how you would run your docker images and include a link to one of your docker files.
Answer length: 100-200 words.
Example: For our project we developed several images: one for training, inference and deployment. For example to run the training docker image: docker run trainer:latest lr=1e-3 batch_size=64
. Link to docker file:
Answer:
--- question 15 fill here ---
"},{"location":"reports/#question-16","title":"Question 16","text":"When running into bugs while trying to run your experiments, how did you perform debugging? Additionally, did you try to profile your code or do you think it is already perfect?
Answer length: 100-200 words.
Example: Debugging method was dependent on group member. Some just used ... and others used ... . We did a single profiling run of our main code at some point that showed ...
Answer:
--- question 16 fill here ---
"},{"location":"reports/#working-in-the-cloud","title":"Working in the cloud","text":"In the following section we would like to know more about your experience when developing in the cloud.
"},{"location":"reports/#question-17","title":"Question 17","text":"List all the GCP services that you made use of in your project and shortly explain what each service does?
Answer length: 50-200 words.
Example: We used the following two services: Engine and Bucket. Engine is used for... and Bucket is used for...
Answer:
--- question 17 fill here ---
"},{"location":"reports/#question-18","title":"Question 18","text":"The backbone of GCP is the Compute engine. Explained how you made use of this service and what type of VMs you used?
Answer length: 100-200 words.
Example: We used the compute engine to run our ... . We used instances with the following hardware: ... and we started the using a custom container: ...
Answer:
--- question 18 fill here ---
"},{"location":"reports/#question-19","title":"Question 19","text":"Insert 1-2 images of your GCP bucket, such that we can see what data you have stored in it. You can take inspiration from this figure.
Answer:
--- question 19 fill here ---
"},{"location":"reports/#question-20","title":"Question 20","text":"Upload one image of your GCP container registry, such that we can see the different images that you have stored. You can take inspiration from this figure.
Answer:
--- question 20 fill here ---
"},{"location":"reports/#question-21","title":"Question 21","text":"Upload one image of your GCP cloud build history, so we can see the history of the images that have been build in your project. You can take inspiration from this figure.
Answer:
--- question 21 fill here ---
"},{"location":"reports/#question-22","title":"Question 22","text":"Did you manage to deploy your model, either in locally or cloud? If not, describe why. If yes, describe how and preferably how you invoke your deployed service?
Answer length: 100-200 words.
Example: For deployment we wrapped our model into application using ... . We first tried locally serving the model, which worked. Afterwards we deployed it in the cloud, using ... . To invoke the service an user would call curl -X POST -F \"file=@file.json\"<weburl>
Answer:
--- question 22 fill here ---
"},{"location":"reports/#question-23","title":"Question 23","text":"Did you manage to implement monitoring of your deployed model? If yes, explain how it works. If not, explain how monitoring would help the longevity of your application.
Answer length: 100-200 words.
Example: We did not manage to implement monitoring. We would like to have monitoring implemented such that over time we could measure ... and ... that would inform us about this ... behaviour of our application.
Answer:
--- question 23 fill here ---
"},{"location":"reports/#question-24","title":"Question 24","text":"How many credits did you end up using during the project and what service was most expensive?
Answer length: 25-100 words.
Example: Group member 1 used ..., Group member 2 used ..., in total ... credits was spend during development. The service costing the most was ... due to ...
Answer:
--- question 24 fill here ---
"},{"location":"reports/#overall-discussion-of-project","title":"Overall discussion of project","text":"In the following section we would like you to think about the general structure of your project.
"},{"location":"reports/#question-25","title":"Question 25","text":"Include a figure that describes the overall architecture of your system and what services that you make use of. You can take inspiration from this figure. Additionally in your own words, explain the overall steps in figure.
Answer length: 200-400 words
Example:
The starting point of the diagram is our local setup, where we integrated ... and ... and ... into our code. Whenever we commit code and puch to github, it auto triggers ... and ... . From there the diagram shows ...
Answer:
--- question 25 fill here ---
"},{"location":"reports/#question-26","title":"Question 26","text":"Discuss the overall struggles of the project. Where did you spend most time and what did you do to overcome these challenges?
Answer length: 200-400 words.
Example: The biggest challenges in the project was using ... tool to do ... . The reason for this was ...
Answer:
--- question 26 fill here ---
"},{"location":"reports/#question-27","title":"Question 27","text":"State the individual contributions of each team member. This is required information from DTU, because we need to make sure all members contributed actively to the project
Answer length: 50-200 words.
Example: Student sXXXXXX was in charge of developing of setting up the initial cookie cutter project and developing of the docker containers for training our applications. Student sXXXXXX was in charge of training our models in the cloud and deploying them afterwards. All members contributed to code by...
Answer:
--- question 27 fill here ---
"},{"location":"s10_extra/","title":"Extra learning modules","text":"All modules listed here are not part of the core course, but expands on some of the other topics. Some of them may still be under construction and may in the future be moved into other sessions.
"},{"location":"s10_extra/cli/","title":"M30 - Command Line Interfaces","text":""},{"location":"s10_extra/cli/#command-line-interfaces","title":"Command line interfaces","text":"
If you have worked with python for some time you are probably familiar with the argparse
package, which allows you to directly pass in additional arguments to your script in the terminal
python my_script.py --arg1 val1 --arg2 val2\n
argparse
is a very simple way of constructing what is called a command line interfaces (CLI). CLI allows you to interact with your application directly in the terminal instead of having change things in your code. It is essentially a text-based user interface (UI) (in contrast to an graphical user interface (GUI) that we know from all our desktop applications).
However, one limitation of argparse
is the possibility of easily defining an CLI with subcommands. If we take git
as an example, git
is the main command but it has multiple subcommands: push
, pull
, commit
etc. that all can take their own arguments. This kind of second CLI with subcommands is somewhat possible to do using only argparse
, however it requires a bit of hacks.
You could of course ask the question why we at all would like to have the possibility of defining such CLI. The main argument here is to give users of our code a single entrypoint to interact with our application instead of having multiple scripts. As long as all subcommands are proper documented, then our interface should be simple to interact with (again think git
where each subcommand can be given the -h
arg to get specific help).
Instead of using argparse
we are here going to look at the click package. click
extends the functionalities of argparse
to allow for easy definition of subcommands and many other things, which we are not going to touch upon in this module. For completeness we should also mention that click
is not the only package for doing this, and of other excellent frameworks for creating command line interfaces easily we can mention Typer.
Exercise files
Install click
pip install click\n
Create a new python file greetings.py
and add the following code:
import click\n\n@click.command()\n@click.option('--count', default=1, help='Number of greetings.')\n@click.option('--name', prompt='Your name', help='The person to greet.')\ndef hello(count, name):\n \"\"\"Simple program that greets NAME for a total of COUNT times.\"\"\"\n for x in range(count):\n click.echo(f\"Hello {name}!\")\n\nif __name__ == '__main__':\n hello()\n
try running the program in the following ways
python greetings.py\npython greetings.py --count=3\npython greetings.py --help\n
Make sure you understand what the click.command()
decorator and click.option
decorator does. You can find the full API docs here.
As stated above, the power of using a tool like click is due to its ability to define subcommands. In click
this is done through the click.group()
decorator. To the code example from above, add another command:
@click.command()\n@click.option('--count', default=1, help='Number of greetings.')\n@click.option('--name', prompt='Your name', help='The person to greet.')\ndef howdy(count, name):\n for x in range(count):\n click.echo(f\"Howdy {name}!\")\n
and by using the click.group()
decorator make these commands into subcommands such that you would be able to call the script in the following way
python greetings.py hello\npython greetings.py howdy\n
As an final exercise we provide you with a script that is ready to run as it is, but your job will be do turn it into a script with multiple subcommands, with multiple arguments for each subcommand.
Start by taking a look at the provided code. It is a simple script that runs the K-nearest neighbour classification algorithm on the iris dataset and produces a plot of the decision boundary.
Create a script that has the following subcommands with input arguments
train
: Load data, train model and save. Should take a single argument -o
that specifics the filename the trained model should be saved to.infer
: Load trained model and runs prediction on input data. Should take two arguments: -i
that specifies which trained model to load and -d
to specify a user defined datapoint to run inference on.plot
: Load trained model and constructs the decision boundary plot from the code. Should take two arguments: -i
that specifies a trained model to load and -o
the file to write the generated plot tooptim
: Load data, runs hyperparameter optimization and prints optimal parameters. Should at least take a single argument that in some way adjust the hyperparameter optimization (free to choose how)In the end we like the script to be callable in the following ways
python main.py train -o 'model.ckpt'\npython main.py infer -i 'model.ckpt' -d [[0,1]]\npython main.py plot -i 'model.ckpt' -o 'generated_plot.png'\npython main.py optim\n
Danger
Module is still under development
\"Machine learning engineering is 10% machine learning and 90% engineering.\" - Chip Huyen
We highly recommend that you read the book Designing Machine Learning Systems: An Iterative Process for Production-Ready Applications by Chip Huyen which gives an fantastic overview of the thought processes that goes into designing moder machine learning systems.
"},{"location":"s10_extra/design/#the-stack","title":"The stack","text":"Have you ever encountered the concept of full stack developer. A full stack developer is an developer who can both develop client and server software or in more general terms, it is a developer who can take care of the complete developer pipeline.
Below is seen an image of the massive amounts of tools that exist within the MLOps umbrella.
"},{"location":"s10_extra/design/#visualizing-the-design","title":"Visualizing the design","text":""},{"location":"s10_extra/documentation/","title":"M31 - Documentation","text":""},{"location":"s10_extra/documentation/#documentation","title":"Documentation","text":"In today's rapidly evolving software development landscape, effective documentation is a crucial component of any project. The ability to create clear, concise, and user-friendly technical documentation can make a significant difference in the success of your codebase. We all probably encountered code that we wanted to use, only for us to abandon using it because it was missing documentation such that we could get started with it.
Technical documentation or code documentation can be many things:
and many more. We are in this module going to focus on setting up a very basic documentation system that will automatically help you document the API of your code. For this reason we recommend that before continuning with this module that you have completed module M7 on good coding practices or have similar experience with writing docstrings for python functions and classes.
There are different systems for writing documentation. In fact there is a lot to choose from:
Important to note that all these are static site generators. The word static here refers to that when the content is generated and served on a webside, the underlying HTML code will not change. It may contain HTML elements that dynamic (like video), but the site does not change (1).
We are in this module going to look at Mkdocs, which (in my opinion) is one of the easiest systems to get started with because all documentation is written in markdown and the build system is written in Python. As an alternativ, you can consider doing the exercises in Sphinx which is probably the most used documentation system for Python code. Sphinx offer more customization than Mkdocs, so is generally preferred for larger projects with complex documentation, but for smaller projects Mkdocs should be easier to get started with and is sufficient.
Mkdocs by default does not include many features and for that reason we are directly going to dive into using the material for mkdocs theme that provides a lot of nice customization to create professional static sites. In fact, this hole course is written in mkdocs using the material theme.
"},{"location":"s10_extra/documentation/#mkdocs","title":"Mkdocs","text":"The core file when using mkdocs is the mkdocs.yml
file, which is the configuration file for the project:
site_name: Documentation of my project\nsite_author: Jane Doe\ndocs_dir: source # (1)!\n\ntheme:\n language: en\n name: material # (2)!\n features: # (3)!\n - content.code.copy\n - content.code.annotate\n\nplugins: # (4)!\n - search\n - mkdocstrings\n\nnav: # (5)!\n - Home: index.md\n
This indicates the source directory of our documentation. If the layout of your documentation is a bit different than what described above, you may need to change this.
The overall theme of your documentation. We recommend the material
theme but there are many more to choose from and you can also create your own.
The featuers
section is where features that are supported by your given theme can be enabled. In this example we have enabled content.code.copy
feature which adds a small copy button to all code block and the content.code.annotate
feature which allows you to add annotations like this box to code blocks.
Plugins add new functionality to your documentation. In this case we have added two plugins that add functionality for searching through our documentation and automatically adding documentation from docstrings. Remember that some plugins requires you to install additional Python packages with those plugins, so remember to add them to your requirements.txt
file.
The nav
section is where you define the navigation structure of your documentation. When you add new .md
files to the source
folder you then need to add them to the nav
section.
And that is more or less what you need to get started. In general, if you need help with configuration of your documentation in mkdocs I recommend looking at this page and this page.
"},{"location":"s10_extra/documentation/#exercises","title":"Exercises","text":"In this set of exercises we assume that you have completed module M6 on code structure and therefore have a repository that at least contains the following:
\u251c\u2500\u2500 pyproject.toml <- Project configuration file with package metadata\n\u2502\n\u251c\u2500\u2500 docs <- Documentation folder\n\u2502 \u2502\n\u2502 \u251c\u2500\u2500 index.md <- Homepage for your documentation\n\u2502 \u2502\n\u2502 \u251c\u2500\u2500 mkdocs.yml <- Configuration file for mkdocs\n\u2502 \u2502\n\u2502 \u2514\u2500\u2500 source/ <- Source directory for documentation files\n\u2502\n\u2514\u2500\u2500 src <- Source code for use in this project.\n\u2502 \u2502\n\u2502 \u251c\u2500\u2500 __init__.py <- Makes src a Python module\n\u2502 \u2502\n\u2502 \u251c\u2500\u2500 models <- model implementations, training script\n\u2502 \u2502 \u251c\u2500\u2500 __init__.py\n\u2502 \u2502 \u251c\u2500\u2500 model.py\n\u2502 \u2502 \u251c\u2500\u2500 train_model.py\n...\n
It is not important exactly what is in the src
folder for the exercises, but we are going to refer to the above structure in the exercises, so adjust accordingly if you diviate from this. Additionally, we are going to assume that your project code is installed in your environment such that it can be imported as normal python code.
We are going to need two python packages to get started: mkdocs and material for mkdocs. Install with
pip install \"mkdocs-material >= 4.8.0\" # (1)!\n
mkdocs
is a dependency of mkdocs-material
we only need to install the latter.Run in your terminal (from the docs
folder):
mkdocs serve # (1)!\n
mkdocs serve
will automatically rebuild the hole site whenever you save a file inside the docs
folder. This is not a problem if you have a fairly small site with not that many pages (or elements), but can take a long time for large sites. Consider running with the --dirty
option for only re-building the site for files that have been changed.which should render the index.md
file as the homepage. You can leave the documentation server running during the remaining exercises.
We are no ready to document the API of our code:
Make sure you at least have one function and class inside your src
module. If you do not have you can for simplicity copy the following module to the src/models/model.py
file
import torch\n\nclass MyNeuralNet(torch.nn.Module):\n \"\"\"Basic neural network class.\n\n Args:\n in_features: number of input features\n out_features: number of output features\n\n \"\"\"\n def __init__(self, in_features: int, out_features: int) -> None:\n self.l1 = torch.nn.Linear(in_features, 500)\n self.l2 = torch.nn.Linear(500, out_features)\n self.r = torch.nn.ReLU()\n\n def forward(self, x: torch.Tensor) -> torch.Tensor:\n \"\"\"Forward pass of the model.\n\n Args:\n x: input tensor expected to be of shape [N,in_features]\n\n Returns:\n Output tensor with shape [N,out_features]\n\n \"\"\"\n return self.l2(self.r(self.l1(x)))\n
and the following function to add src/predict_model.py
file:
def predict(\n model: torch.nn.Module,\n dataloader: torch.utils.data.DataLoader\n) -> None:\n \"\"\"Run prediction for a given model and dataloader.\n\n Args:\n model: model to use for prediction\n dataloader: dataloader with batches\n\n Returns\n Tensor of shape [N, d] where N is the number of samples and d is the output dimension of the model\n\n \"\"\"\n return [model(batch) for batch in dataloader]\n
Add a markdown file to the docs/source
folder called my_api.md
and add that file to the nav:
section in the mkdocs.yaml
file.
To that file add the following code:
# My API\n\n::: src.models.model.MyNeuralNet\n\n::: src.predict_model.predict\n
The :::
indicator tells mkdocs that it should look for the corresponding function/module and then render it on the given page. Thus, if you have a function/module located in another location change the paths accordingly.
Make sure that the documentation correctly includes your function and module on the given page.
(Optional) Include more functions/modules in your documentation.
(Optional) Look through the documentation for mkdocstrings and try to improve the layout a bit. Especially, the headings, docstrings and signatures could be of interest to adjust.
Finally, try to build a final version of your documentation
mkdocs build\n
this should result in a site
folder that contains the actual HTML code for documentation.
To publish your documentation you need a place to host your build documentation e.g. the content of the site
folder you build in the last exercise. There are many places to host your documentation, but if you only need a static site and are already hosting your code through Github, then a good option is Github Pages. Github pages is free to use for your public projects.
Before getting started with this set of exercises you should have completed module M16 on github actions so you already know about workflow files.
"},{"location":"s10_extra/documentation/#exercises_1","title":"Exercises","text":"Start by adding a new file called deploy_docs.yaml
to the .github/workflows
folder. Add the following cod to that file and save it.
name: Deploy docs\n\non:\npush:\n branches:\n - main\n\npermissions:\n contents: write # (1)\n\njobs:\ndeploy:\n runs-on: ubuntu-latest\n steps:\n - uses: actions/checkout@v3\n with:\n fetch-depth: 0\n - uses: actions/setup-python@v4\n with:\n python-version: 3.10\n - uses: actions/cache@v2\n with:\n key: ${{ github.ref }}\n path: .cache\n - run: pip install -r requirements.txt\n - run: mkdocs gh-deploy --force\n
write
premissions to this actions because it is not only reading your code but it will actually also push code.Before continuing, make sure you understand what the different steps of the workflow does and especially we recommend looking at the documentation of the mkdocs gh-deploy
command.
Commit and push the file. Check that the action is executed and if it succeeds, that your build project is pushed to a branch called gh-pages
. If the action does not succeeds, then figure out what is wrong and fix it!
After confirming that our action is working, you need to configure Github to actually publish the content being build by Github Actions. Do the following:
Source
setting choose the Deploy from a branch
Branch
setting choose the gh-pages
branch and /(root)
folder and save
This should then start deploying your site to https://<your-username>.github.io/<your-reponame>/
. If it does not do this you may need to recommit and trigger the github actions build again.
Make sure your documentation is published and looks as it should.
This ends the module on technical documentation. We cannot stress enough how important it is to write proper documentation for larger projects that need to be maintained over a longer time. It is often a iterative process, but it is often best to do it while writing the code.
"},{"location":"s10_extra/frontend/","title":"Frontend","text":"Danger
Module is still under development
"},{"location":"s10_extra/frontend/#streamlit","title":"Streamlit","text":"steamlit
streamlit
pip install streamlit\n
and run streamlit hallo
afterwards to check that everything works as expected.
As discussed in the intro session on the cloud, cloud providers offers near infinite compute resources. However, using these resources comes at a hefty price often and it is therefore important to be aware of another resource many have access to: High Performance Clusters or HPC. HPCs exist all over the world, and many time you already have access to one or can easily get access to one. If you are an university student you most likely have a local HPC that you can access through your institution. Else, there exist public HPC resources that everybody (with a project) can apply for. As an example in the EU we have EuroHPC initiative that currently has 8 different supercomputers with a centralized location for applying for resources that are both open for research projects and start-ups.
Depending on your application, you may have different needs and it is therefore important to be aware also of the different tiers of HPC. In Europe, HPC are often categorized such that Tier-0 are European Centers with petaflop or hexascale machines,\u00a0Tier 1 are National centers of supercomputers, and Tier 2 are Regional centers. The lower the Tier, the larger applications it is possible to run.
Image credit"},{"location":"s10_extra/high_performance_clusters/#cluster-architectures","title":"Cluster architectures","text":"In very general terms, cluster can come as two different kind of systems: supercomputers and LSF (Load Sharing Facility). A supercomputer (as shown below) is organized into different modules, that are separated by network link. When you login to a supercomputer you will meet the front end which contains all the software needed to run computations. When you submit a job it will get sent to the backend modules which in most cases includes: general compute modules (CPU), acceleration modules (GPU), a memory module (RAM) and finally a storage module (HDD). Depending on your application you may need one module more than another. For example in deep learning the acceleration module is important but in physics simulation the general compute module / storage model is probably more important.
Overview of the Meluxina supercomputer that's part of EuroHPC. Image creditAlternatively, LSF are a network of computers where each computer has its own CPU, GPU, RAM etc. and the individual computes (or nodes) are then connected by network. The important different between a supercomputer and as LSF systems is how the resources are organized. When comparing supercomputers to LSF system it is generally the case that it is better to run on a LSF system if you are only requesting resources that can be handled by a single node, however it is better to run on a supercomputer if you have a resource intensive application that requires many devices to communicate with each others.
Regardless of cluster architectures, on the software side of HPC, the most important part is what's called the HPC scheduler. Without a HPC scheduler an HPC cluster would just be a bunch of servers with different jobs interfering with each other. The problem is when you have a large collection of resources and a large collection of users, you cannot rely on the users just running their applications without interfering with each other. A HPC scheduler is in charge of managing that whenever an user request to run an application, they get put in a queue and whenever the resources their application ask for are available the application gets run.
The biggest bach control systems for doing scheduling on HPC are:
We are going to take a look at PBS works as that is what is installed on our local university cluster.
"},{"location":"s10_extra/high_performance_clusters/#exercises","title":"\u2754 Exercises","text":"Exercise files
The following exercises are focused on local students at DTU that want to use our local HPC resources. That said, the steps in the exercise are fairly general to other types of cluster. For the purpose of this exercise we are going to see how we can run this image classifier script , but feel free to work with whatever application you want to.
Start by accessing the cluster. This can either be through ssh
in a terminal or if you want a graphical interface thinlinc can be installed. In general we recommend following the steps here for DTU students as the setup depends on if you are on campus or not.
When you have access to the cluster we are going to start with the setup phase. In the setup phase we are going to setup the environment necessary for our computations. If you have accessed the cluster through graphical interface start by opening a terminal.
Lets start by setting up conda for controlling our dependencies. If you have not already worked with conda
, please checkout module M2 on package managers and virtual environments. In general you should be able to setup (mini)conda through these two commands:
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh\nsh Miniconda3-latest-Linux-x86_64.sh\n
Close the terminal and open a new for the installation to complete. Type conda
in the terminal to check that everything is fine. Go ahead and create a new environment that we can install dependencies in
conda create -n \"hpc_env\" python=3.10 --no-default-packages\n
and activate it.
Copy over any files you need. For the image classifier script you need the requirements file and the actual application.
Next, install all the requirements you need. If you want to run the image classifier script you can run this command in the terminal
pip install -r image_classifier_requirements.txt\n
using this requirements file.
That's all the setup needed. You would need to go through the creating of environment and installation of requirements whenever you start a new project (no need for reinstalling conda). For the next step we need to look at how to submit jobs on the cluster. We are now ready to submit the our first job to the cluster:
Start by checking the statistics for the different clusters. Try to use both the qstat
command which should give an overview of the different cluster, number of running jobs and number of pending jobs. For many system you can also try the much more user friendly command classstat
command.
Figure out which queue you want to use. For the sake of the exercises it needs to be one with GPU support. For DTU students, any queue that starts with gpu
are GPU accelerated.
Now we are going to develop a bash script for submitting our job. We have provided an example of such scripts. Take a careful look and go each line and make sure you understand it. Afterwards, change it to your needs (queue and student email).
Try to submit the script:
bsub < jobscript.sh\n
You can check the status of your script by running the bstat
command. Hopefully, the job should go through really quickly. Take a look at the output file, it should be called something like gpu_*.out
. Also take a look at the gpu_*.err
file. Does both files look as they should?
Lets now try to run our application on the cluster. To do that we need to take care of two things:
First we need to load the correct version of CUDA. A cluster system often contains multiple versions of specific software to suit the needs of all their users, and it is the users that are in charge of loading the correct software during job submission. The only extra software that needs to be loaded for most Pytorch applications are a CUDA module. You can check which modules are available on the cluster with
module avail\n
Afterwards, add the correct CUDA version you need to the jobscript.sh
file. If you are trying to run the provided image classifier script then the correct version is CUDA/11.7
(can be seen in the requirements file).
# add to the bottom of the file\nmodule load cuda/11.7\n
We are now ready to add in our application. The only thing we need to take care of is telling the system to run it using the python
version that is connected to our hpc_env
we created in the beginning. Try typing:
which python\n
which should give you the full path. Then add to the bottom of the jobscript
file:
~/miniconda3/envs/hpc_env/bin/python \\\n image_classifier.py \\\n --trainer.accelerator 'gpu' --trainer.devices 1 --trainer.max_epochs 5\n
which will run the image classifier script (change it if you are running something else).
Finally submit the job:
bsub < jobscript.sh\n
and check when it is done that it has produced what you expected.
(Optional) If you application supports multi GPUs also try that out. You would first need to change the jobscript to request multiple GPUs and additionally you would need to tell your application to run on multiple GPUs. For the image classifier script it can be done by changing the --trainer.devices
flag to 2
(or higher).
This ends the module on using HPC systems.
"},{"location":"s10_extra/hyperparameters/","title":"M32 - Hyperparameter optimization","text":""},{"location":"s10_extra/hyperparameters/#hyperparameter-optimization","title":"Hyperparameter optimization","text":"Hyperparameter optimization is not a new idea within machine learning but have somewhat seen a renaissance with the uprise of deep learning. This can mainly be contributed to the following:
However the problem with doing hyperparameter optimization of a deep learning models is that it can take over a week to train a single model. In most cases we therefore cannot do a full grid search of all hyperparameter combinations to get the best model. Instead we have to do some tricks that will help us speed up our searching. In these exercises we are going to be integrating optuna into our different models, that will provide the tools for speeding up our search.
It should be noted that a lot of deep learning models does not optimize every hyperparameter that is included in the model but instead relies on heuristic guidelines (\"rule of thumb\") based on what seems to be working in general e.g. a learning rate of 0.01 seems to work great with the Adam optimizer. That said, these rules probably only apply to 80% of deep learning model, whereas for the last 20% the recommendations may be suboptimal Here is a great site that has collected an extensive list of these recommendations, taken from the excellent deep learning book by Ian Goodfellow, Yoshua Bengio and Aaron Courville.
In practice, I recommend trying to identify (through experimentation) which hyperparameters that are important for the performance of your model and then spend your computational budget trying to optimize them while setting the rest to a \"recommended value\".
"},{"location":"s10_extra/hyperparameters/#exercises","title":"\u2754 Exercises","text":"Exercise files
Start by installing optuna: pip install optuna
Initially we will look at the cross_validate.py
file. It implements simple K-fold cross validation of a random forest sklearn digits dataset (subset of MNIST). Look over the script and try to run it.
We will now try to write the same code in optune. Please note that the script have a variable OPTUNA=False
that you can use to change what part of the code should run. The three main concepts of optuna is
A trial: a single experiment
A study: a collection of trials
The objective: function to determine how \"good\" a trial is
Lets start by writing the objective function, which we have already started in the script. For now you do not need to care about the trial
argument, just assume that it contains the hyperparameters needed to define your random forest model. The output of the objective function should be a single number that we want to optimize. (HINT: did you remember to do K-fold crossvalidation inside your objective function?)
Next lets focus on the trial. Inside the objective
function the trial should be used to suggest what parameters to use next. Take a look at the documentation for trial or take a look at the code examples and figure out how to define the hyperparameter of the model.
Finally lets launch a study. It can be as simple as
study = optuna.create_study()\nstudy.optimize(objective, n_trials=100)\n
but lets play around a bit with it:
By default the .optimize
method will minimize the objective (by definition the optimum of an objective function is at its minimum). Is the score your objective function is returning something that should be minimized? If not, a simple solution is to put a -
in front of the metric. However, look through the documentation on how to change the direction of the optimization.
Optuna will by default do Bayesian optimization when sampling the hyperparameters (using a evolutionary algorithm for suggesting new trials). However, since this example is quite simple, we can actually perform a full grid search. How would you do this in Optuna?
Compare the performance of a single optuna run using Bayesian optimization with n_trials=10
with a exhaustive grid search that have search through all hyperparameters. What is the performance/time trade-off for these two solutions?
In addition to doing baysian optimization, the other great part about Optuna is that it have native support for Pruning unpromising trials. Pruning refers to the user stopping trials for hyperparameter combinations that does not seem to lead anywhere. You may have learning rate that is so high that training is diverging or a neural network with too many parameters so it is just overfitting to the training data. This however begs the question: what constitutes an unpromising trial? This is up to you to define based on prior experimentation.
Start by looking at the fashion_trainer.py
script. Its a simple classification network for classifying images in the FashionMNIST dataset. Run the script with the default hyperparameters to get a feeling of how the training should be progress. Note down the performance on the test set.
Start by defining a validation set and a validation dataloader that we can use for hyperparameter optimization (HINT: use 5-10% of you training data).
Now, adjust the script to use Optuna. The 5 hyperparameters listed in the table above should at least be included in the hyperparameter search. For some we have already defined the search space but for the remaining you need to come up with a good range of values to investigate. We done integrating optuna, run a small study (n_tirals=3
) to check that the code is working.
nn.ReLU
, nn.Tanh
, nn.RReLU
, nn.LeakyReLU
, nn.ELU
} If implemented correctly the number of hyperparameter combinations should be at least 1000, meaning that we not only need baysian optimization but probably also need pruning to succeed. Checkout the page for built-in pruners in Optuna. Implement pruning in the script. I recommend using either the MedianPruner
or the ProcentilePruner
.
Re-run the study using pruning with a large number of trials (n_trials>50
)
Take a look at this visualization page for ideas on how to visualize the study you just did. Make at least two visualization of the study and make sure that you understand them.
Pruning is great for better spending your computational budged, however it comes with a trade-off. What is it and what hyperparameter should one be especially careful about when using pruning?
Finally, what parameter combination achieved the best performance? What is the test set performance for this set of parameters. Did you improve over the initial set of hyperparameters?
The exercises until now have focused on doing the hyperparameter searching sequentially, meaning that we test one set of parameters at the time. It is a fine approach because you can easily let it run for a week without any interaction. However, assuming that you have the computational resources to run in parallel, how do you do that?
To run hyperparameter search in parallel we need a common database that all experiments can read and write to. We are going to use the recommended mysql
. You do not have to understand what SQL is to complete this exercise, but it is basically a language (like python) for managing databases. Install mysql.
Next we are going to initialize a database that we can read and write to. For this exercises we are going to focus on a locally stored database but it could of course also be located in the cloud.
mysql -u root -e \"CREATE DATABASE IF NOT EXISTS example\"\n
you can also do this directly in python when calling the create_study
command by also setting the storage
and load_if_exists=True
flags.
Now we are going to create a Optuna study in our database
optuna create-study --study-name \"distributed-example\" --storage \"mysql://root@localhost/example\"\n
Change how you initialize the study to read and write to the database. Therefore, instead of doing
study = optuna.create_study()\n
then do
study = optuna.load_study(\n study_name=\"distributed-example\", storage=\"mysql://root@localhost/example\"\n)\n
where the study_name
and storage
should match how the study was created.
For running in parallel, you can either open up a extra terminal and simple launch your script once per open terminal or you can use the provided parallel_lancher.py
that will launch multiple executions of your script. It should be used as:
python parallel_lancher.py myscript.py --num_parallel 2\n
Finally, make sure that you can access the results
That's all on how to do hyperparameter optimization in a scalable way. If you feel like it you can try to apply these techniques on the ongoing corrupted MNIST example, where you are free to choose what hyperparameters that you want to use.
"},{"location":"s10_extra/kubernetes/","title":"Kubernetes","text":"Danger
Module is still under development
"},{"location":"s10_extra/kubernetes/#kubernetes","title":"Kubernetes","text":"Kubernetes, also known as K8s, is an open-source platform designed to automate the deployment, scaling, and operation of application containers across clusters of hosts. It provides the framework to run distributed systems resiliently, handling scaling and failover for your applications, providing deployment patterns, and more.
"},{"location":"s10_extra/kubernetes/#what-is-kubernetes","title":"What is Kubernetes?","text":""},{"location":"s10_extra/kubernetes/#brief-history","title":"Brief History","text":"Kubernetes was originally developed by Google and is now maintained by the Cloud Native Computing Foundation.
"},{"location":"s10_extra/kubernetes/#core-functions","title":"Core Functions","text":"Kubernetes makes it easier to deploy and manage containerized applications at scale.
"},{"location":"s10_extra/kubernetes/#key-concepts","title":"Key Concepts","text":"Kubernetes follows a client-server architecture. At a high level, it consists of a Control Plane (master) and Nodes (workers).
Overview of Kubernetes Architecture. Image Credit: Kubernetes Official Documentation"},{"location":"s10_extra/kubernetes/#control-plane-components","title":"Control Plane Components","text":"Minikube is a tool that allows you to run Kubernetes locally. It runs a single-node Kubernetes cluster inside a VM on your laptop for users looking to try out Kubernetes or develop with it day-to-day.
"},{"location":"s10_extra/kubernetes/#installing-minikube","title":"Installing Minikube","text":"minikube start
.minikube
in a terminal.kubectl
in a terminal.Yatai is a model serving platform, making it easier to deploy machine learning models in Kubernetes environments.
"},{"location":"s10_extra/kubernetes/#what-is-yatai","title":"What is Yatai?","text":"Yatai simplifies the deployment, management, and scaling of machine learning models in Kubernetes.
"},{"location":"s10_extra/kubernetes/#getting-started-with-yatai","title":"Getting Started with Yatai","text":"Danger
Module is still under development
"},{"location":"s10_extra/onnx/#model-packaging","title":"Model packaging","text":"Whenever we want to serve an machine learning model, what we are actually interested in is doing predictions e.g. given a new datapoint we pass it through our model (forward pass) and the returned value is the predicted value of that datapoint. At a high-level, model predictions depends on three things:
We have already in module M9 on Docker touch on how to take care of all these things. Containers makes it easy to link a codebase, model weights and code dependencies into a single object. We in general can refer to this as model packaging, because as the name suggest, we are packaging our model into a format that is independent of the actual environment that we are trying to run the model in.
However, containers is not the only way to do model packaging. If we put some light restrictions on the device we want run our model predictions on, we can achieve the same result using ONNX. The Open Neural Network Exchange (ONNX) is a standardized format for creating and sharing machine learning models. ONNX provides an open source format for machine learning models, both deep learning and traditional ML. It defines an extensible computation graph model, as well as definitions of built-in operators and standard data types.
Image creditAs the above image indicates, the idea behind ONNX is that a model trained with a specific framework on a specific device, lets say Pytorch on your local computer, can be exported and run with an entirely different framework and hardware easily. For example, not all frameworks are created equally. For example Pytorch is in general considered an developer friendly framework, however it has historically been slow to run inference with compared to a framework such as Caffe2. ONNX allow you to mix-and-match frameworks based on different usecases, and essentially increases the longivity of your model.
"},{"location":"s10_extra/onnx/#exercises","title":"\u2754 Exercises","text":"Start by installing ONNX:
pip install onnx\npip install onnxruntime\n
the first package includes the basic building blocks for implementing generalized ONNX models and the second package is for running ONNX optimal on different hardware.
As an test that your installation is working, try executing the following python code
import onnxruntime\nonnxruntime.get_all_providers()\n
these providers are translation layers that are implemented ONNX, such that the same ONNX model can run on completely different hardware. Can you identify at least two of the providers that are necessary for running standard Pytorch code on CPU and GPU? Can you identify others
One big advantage of having a standardized format, is that we can easily visualize the computational graph of our model because it consist only of core ONNX operations. We are here going to use the open-source tool netron for visualization. You can either choose to download the program or just run it in your webbrowser.
Danger
Module is still under development
Image credit"},{"location":"s10_extra/pipeline/#dags","title":"DAGs","text":"Directed Acyclic Graph (DAG)
"},{"location":"s10_extra/pipeline/#exercises","title":"\u2754 Exercises","text":"Start by installing prefect
:
pip install prefect\n
Start a local Prefect server instance in your virtual environment.
prefect server start\n
The great thing about Prefect is that the orchestration tasks and flows are written in pure Python.
Slides
Today we start our journey into the world of machine learning operations (MLOps). However, before we can really get started, we need to make sure that you have a basic understanding of a couple of topics, as we will be using these throughout the course. In particular, today is all about getting set up with a proper development environment that can support your journey. Most of you probably already have experience with these topics, and it will be mostly repetition.
The reason we are starting here is that many students are missing very basic skills that are never taught but are just expected to be picked up by yourself. This session will only cover the most basic skills to get you started on your development journey. If you wish to learn more about basic computer science skills in general, we highly recommend that you check out The Missing Semester of Your CS Education course from MIT.
Learning objectives
The learning objectives of this session are:
Core Module
Image creditContrary to popular belief, the command line (also commonly known as the terminal) is not a mythical being that has existed since the dawn of time. Instead, it was created at a time when it was not given that your computer had a graphical interface that you could interact with. Think of it as a text interface to your computer.
The terminal is a well-known concept to users of Linux, however, MAC and (especially) Windows users often do not need and therefore encounter it. Having a basic understanding of how to use a command line can help improve your workflow. The reason that the command line is an important tool to get to know, is that doing any kind of MLOps will require us to be able to interact with many different tools, many of which do not have a graphical interface. Additionally, when we get to working in the cloud later in the course, you will be forced to interact with the command line.
Note if you already are a terminal wizard then feel free to skip the exercises below. They are very elementary.
"},{"location":"s1_development_environment/command_line/#the-anatomy-of-the-command-line","title":"The anatomy of the command line","text":"Regardless of the operating system, all command lines look more or less the same:
As already stated, it is essentially just a big text interface to interact with your computer. As the image illustrates, when trying to execute a command, there are several parts to it:
$
, >
, :
are the usual ones. It can also contain other information, such as in the case of the above image which also shows the current conda
environment.ls
or cd
ls -l
or cd ..
.ls -l figures
or cd ..
.The core difference between options and arguments is that options are optional, while arguments are not.
Image credit"},{"location":"s1_development_environment/command_line/#exercises","title":"\u2754 Exercises","text":"We have put a cheat sheet in the exercise files folder belonging to this session, that gives a quick overview of the different commands that can be executed in the command line.
Windows usersWe highly recommend that you install Windows Subsystem for Linux (WSL). This will install a full Linux system on your Windows machine. Please follow this guide. Remember to run commands from an elevated (as administrator) Windows Command Prompt. You can in general complete all exercises in the course from a normal Windows Command Prompt, but some are easier to do if you run from WSL.
If you decide to run in WSL you need to remember that you now have two different systems, and installing a package on one system does not mean that it is installed on the other. For example, if you install pip
in WSL, you need to install it again in Windows if you want to use it there.
If you decide to not run in WSL, please always work in a Windows Command Prompt and not Powershell.
Start by opening a terminal.
To navigate inside a terminal, we rely on the cd
command and pwd
command. Make sure you know how to go back and forth in your file system. (1)
The ls
command is important when we want to know the content of a folder. Try to use the command, and also try it with the additional option -l
. What does it show?
Make sure to familiarize yourself with the which
, echo
, cat
, wget
, less
and top
commands. Also, familiarize yourself with the >
operator. You are probably going to use some of them throughout the course or in your future career. For Windows users, these commands may be named something else, e.g. where
command on Windows corresponds to which
.
It is also significant that you know how to edit a file through the terminal. Most systems should have the nano
editor installed, else try to figure out which one is installed in your system.
Type nano
in the terminal
Write the following text in the script
if __name__ == \"__main__\":\n print(\"Hello world!\")\n
Save the script and try to execute it
Afterward, try to edit the file through the terminal (change Hello world
to something else)
All terminals come with their own programming language. The most common system is called bash
. It can come in handy being able to write simple programs in bash. For example, one case is that you want to execute multiple Python programs sequentially, which can be done through a bash script.
Bash is not part of Windows, so you need to run this part through WSL. If you did not install WSL, you can skip this part or as an alternative do the exercises in Powershell which is the native Windows scripting language (not recommended).
Write a bash script (in nano
) and try executing it:
#!/bin/bash\n# A sample Bash script, by Ryan\necho Hello World!\n
Change the bash script to call the Python program you just wrote.
Try to Google how to write a simple for-loop that executes the Python script 10 times in a row.
Here is one command from later in the course when we are going to work in the cloud
gcloud compute instances create-with-container instance-1 \\\n --container-image=gcr.io/<project-id>/gcp_vm_tester\n --zone=europe-west1-b\n
Identify the command, options and arguments.
Solutiongcloud compute instances create-with-container
.--container-image=gcr.io/<project-id>/gcp_vm_tester
and --zone=europe-west1-b
.instance-1
.The tricky part of this example is that commands can have subcommands, which are also commands. In this case compute
is a subcommand to gcloud
, instances
is a subcommand to compute
and create-with-container
is a subcommand to instances
Two common arguments that nearly all commands have are the -h
and -V
options. What does each of them do?
The -h
(or --help
) option prints the help message for the command, including subcommands and arguments. Try it out by executing python -h
. The -V
(or --version
) option prints the version of the installed program. Try it out by executing python --version
.
This ends the module on the command line. If you are still not comfortable working with the command line, fear not as we are going to use it extensively throughout the course. If you want to spend additional time on this topic, we highly recommend that you watch this video on how to use the command line.
If you are interested in personalizing your command line, you can check out the starship project, which allows you to customize your command line with a lot of different options.
"},{"location":"s1_development_environment/deep_learning_software/","title":"M4 - Deep Learning Software","text":""},{"location":"s1_development_environment/deep_learning_software/#deep-learning-software","title":"Deep Learning Software","text":"Core Module
Deep learning has since its revolution back in 2012 transformed our lives. From Google Translate to driverless cars to personal assistants to protein engineering, deep learning is transforming nearly every sector of our economy and our lives. However, it did not take long before people realized that deep learning is not a simple beast to tame and it comes with its own kinds of problems, especially if you want to use it in a production setting. In particular the concept of technical debt was invented to indicate the significant maintenance costs at a system level that it takes to run machine learning in production. MLOps should very much be seen as the response to the concept of technical debt, namely that we should develop methods, processes and tools (with inspiration from classical DevOps) to counter the problems we run into when working with deep learning models.
It is important to note that all the concepts and tools that have been developed for MLOps can absolutely be used together with more classical machine learning models (think K-nearest neighbor, Random forest etc.), however deep learning comes with its own set of problems which mostly have to do with the sheer size of the data and models we are working with. For these reasons, we are focusing on working with deep learning models in this course.
"},{"location":"s1_development_environment/deep_learning_software/#software-landscape-for-deep-learning","title":"Software landscape for Deep Learning","text":"Regarding software for Deep Learning, the landscape is currently dominated by three software frameworks (listed in order of when they were published):
Tensorflow
Pytorch
JAX
We won't go into a longer discussion on which framework is best, as it is pointless. Pytorch and Tensorflow have been around for the longest and therefore have bigger communities and feature sets at this point in time. They are both very similar in the sense that they both have features directed against research and production. JAX is kind of the new kid on the block, which in many ways improves on Pytorch and Tensorflow, but is still not as mature as the other frameworks. As the frameworks use different kind of programming principles (object oriented vs. functional programming), comparing them is essentially meaningless.
In this course we have chosen to work with Pytorch, because we find it a bit more intuitive and it is the framework that we use for our day to day research life. Additionally, as of right now it is absolutely the dominating framework for published models, research papers and competition winners
The intention behind this set of exercises is to bring everyone's Pytorch skills up-to-date. If you already are a Pytorch-Jedi feel free to pass the first set of exercises, but I recommend that you still complete it. The exercises are in large part taken directly from the deep learning course at udacity. Note that these exercises are given as notebooks, which is the last time we are going to use them actively in course. Instead, after this set of exercises, we are going to focus on writing code in python scripts.
The notebooks contains a lot of explaining text. The exercises that you are supposed to fill out are inlined in the text in small \"exercise\" blocks:
If you need a fresh-up on any deep learning topic in general throughout the course, we recommend to find the relevant chapter in the deep learning book by Ian Goodfellow, Yoshua Bengio and Aaron Courville (can also be found in the literature folder). It is absolutely not necessary to be good at deep learning to pass this course as the focus is on all the software needed to get deep learning models into production. However, it is important to have a basic understanding of the concepts.
"},{"location":"s1_development_environment/deep_learning_software/#exercises","title":"\u2754 Exercises","text":"Exercise files
Start a jupyter notebook session in your terminal (assuming you are standing in the root of the course material). Alternatively you should be able to open the notebooks directly in your code editor. For VS code users you can read more about how to work with jupyter notebooks in VS code here
Complete the Tensors in Pytorch notebook. It focuses on basic manipulation of Pytorch tensors. You can pass this notebook if you are comfortable doing this.
Complete the Neural Networks in Pytorch notebook. It focuses on building a very simple neural network using the Pytorch nn.Module
interface.
Complete the Training Neural Networks notebook. It focuses on how to write a simple training loop for training a neural network.
Complete the Fashion MNIST notebook, that summaries concepts learned in the notebook 2 and 3 on building a neural network for classifying the Fashion MNIST dataset.
Complete the Inference and Validation notebook. This notebook adds important concepts on how to do inference and validation on our neural network.
Complete the Saving_and_Loading_Models notebook. This notebook addresses how to save and load model weights. This is important if you want to share a model with someone else.
If tensor a
has shape [N, d]
and tensor b
has shape [M, d]
how can we calculate the pairwise distance between rows in a
and b
without using a for loop?
We can take advantage of broadcasting to do this
a = torch.randn(N, d)\nb = torch.randn(M, d)\ndist = torch.sum((a.unsqueeze(1) - b.unsqueeze(0))**2, dim=2) # shape [N, M]\n
What should be the size of S
for an input image of size 1x28x28, and how many parameters does the neural network then have?
from torch import nn\nneural_net = nn.Sequential(\n nn.Conv2d(1, 32, 3), nn.ReLU(), nn.Conv2d(32, 64, 3), nn.ReLU(), nn.Flatten(), nn.Linear(S, 10)\n)\n
Solution Since both convolutions have a kernel size of 3, stride 1 (default value) and no padding that means that we lose 2 pixels in each dimension, because the kernel can not be centered on the edge pixels. Therefore, the output of the first convolution would be 32x26x26. The output of the second convolution would be 64x24x24. The size of S
must therefore be 64 * 24 * 24 = 36864
. The number of parameters in a convolutional layer is kernel_size * kernel_size * in_channels * out_channels + out_channels
(last term is the bias) and the number of parameters in a linear layer is in_features * out_features + out_features
(last term is the bias). Therefore, the total number of parameters in the network is 3*3*1*32 + 32 + 3*3*32*64 + 64 + 36864*10 + 10 = 387,466
, which could be calculated by running:
sum([prod(p.shape) for p in neural_net.parameters()])\n
A working training loop in Pytorch should have these three function calls: optimizer.zero_grad()
, loss.backward()
, optimizer.step()
. Explain what would happen in the training loop (or implement it) if you forgot each of the function calls.
optimizer.zero_grad()
is in charge of zeroring the gradient. If this is not done, then gradients would accumulate over the steps leading to exploding gradients. loss.backward()
is in charge of calculating the gradients. If this is not done, then the gradients would not be calculated and the optimizer would not be able to update the weights. optimizer.step()
is in charge of updating the weights. If this is not done, then the weights would not be updated and the model would not learn anything.
As the final exercise we will develop a simple baseline model which we will continue to develop on during the course. For this exercise we provide the data in the data/corruptmnist
folder. Do NOT use the data in the corruptmnist_v2
folder as that is intended for another exercise. As the name suggest this is a (subsampled) corrupted version of regular MNIST. Your overall task is the following:
Implement a MNIST neural network that achieves at least 85 % accuracy on the test set.
Before any training can start, you should identify what corruption that we have applied to the MNIST dataset to create the corrupted version. This can help you identify what kind of neural network to use to get good performance, but any network should really be able to achieve this.
One key point of this course is trying to stay organized. Spending time now organizing your code, will save time in the future as you start to add more and more features. As subgoals, please fulfill the following exercises
Implement your model in a script called model.py
Implement your data setup in a script called data.py
. The data was saved using torch.save
, so to load it you should use torch.load
.
Saving the model
When saving the model, you should use torch.save(model.state_dict(), \"model.pt\")
and when loading the model you should use model.load_state_dict(torch.load(\"model.pt\"))
. If you do torch.save(model, \"model.pt\")
this can lead to problems when loading the model later on, as it will try to not only save the model weights but also the model definition. This can lead to problems if you change the model definition later on (which you most likely is going to do).
Implement training and evaluation of your model in main.py
script. The main.py
script should be able to take an additional subcommands indicating if the model should train or evaluate. It will look something like this:
python main.py train --lr 1e-4\npython main.py evaluate trained_model.pt\n
which can be implemented in various ways.
VS code and command line argumentsIf you try to execute the above code in VS code using the debugger (F5) or the build in run functionality in the upper right corner:
you will get an error message saying that you need to select a command to run e.g. main.py
either needs the train
or evaluate
command. This can be fixed by adding a lunch.json
to a specialized .vscode
folder in the root of the project. The lunch.json
file should look something like this:
{\n \"version\": \"0.2.0\",\n \"configurations\": [\n {\n \"name\": \"Python: Current File\",\n \"type\": \"python\",\n \"request\": \"launch\",\n \"program\": \"${file}\",\n \"args\": [\n \"train\",\n \"--lr\",\n \"1e-4\"\n ],\n \"console\": \"integratedTerminal\",\n \"justMyCode\": true\n }\n ]\n}\n
This will inform VS code that then we execute the current file (in this case main.py
) we want to run it with the train
command and additionally pass the --lr
argument with the value 1e-4
. You can read more about creating a lunch.json
file here. If you want to have multiple configurations you can add them to the configurations
list as additional dictionaries.
To start you off, a very basic version of each script is provided in the final_exercise
folder. We have already implemented some logic, especially to make sure you can easily run different subcommands in for step 4. If you are interested in how this is done you can checkout this optional module on defining command line interfaces (CLI). We additionally also provide an requirements.txt
with suggestion to what packages are necessary to complete the exercise.
As documentation that your model is actually working, when running in the train
command the script needs to produce a single plot with the training curve (training step vs training loss). When the evaluate
command is run, it should write the test set accuracy to the terminal.
It is part of the exercise to not implement in notebooks as code development in the real life happens in script. As the model is simple to run (for now) you should be able to complete the exercise on your laptop, even if you are only training on cpu. That said you are allowed to upload your scripts to your own \"Google Drive\" and then you can call your scripts from a Google Colab notebook, which is shown in the image below where all code is place in the fashion_trainer.py
script and the Colab notebook is just used to execute it.
Be sure to have completed the final exercise before the next session, as we will be building on top of the model you have created.
"},{"location":"s1_development_environment/editor/","title":"M3 - Editor","text":""},{"location":"s1_development_environment/editor/#editoride","title":"Editor/IDE","text":"Core Module
Notebooks can be great for testing out ideas, developing simple code and explaining and visualizing certain aspects of a codebase. Remember that Jupyter notebook was created with intention to \"...allows you to create and share documents that contain live code, equations, visualizations and narrative text.\" However, any larger machine learning project will require you to work in multiple .py
files and here notebooks will provide a suboptimal workflow. Therefore, to for truly getting \"work done\" you will need a good editor / IDE.
Many opinions exist on this matter, but for simplicity we recommend getting started with one of the following 3:
Editor Webpage Comment (Biased opinion) Spyder https://www.spyder-ide.org/ Matlab like environment that is easy to get started with Visual studio code https://code.visualstudio.com/ Support for multiple languages with fairly easy setup PyCharm https://www.jetbrains.com/pycharm/ IDE for python professionals. Will take a bit of time getting used toWe highly recommend Visual studio (VS) code if you do not already have a editor installed (or just want to try something new.). We therefore put additional effort into explaining VS code.
Below you see an overview of the vs code interface
Image creditThe main components of VS code are:
The action bar: VS code is not an editor meant for a single language and can do many things. One of the core reasons that VS code have become so popular is that custom plug-ins called extensions can be installed to add functionality to VS code. It is in the action bar that you can navigate between these different applications when you have installed them.
The side bar: The side bar has different functionality depending on what extension that you have open. In most cases, the side bar will just contain the file explorer.
The editor: This where you code is. VS code supports a number of layouts in the editor (one column, two column etc.). You can make a custom layout by dragging a file to where you want the layout to split.
The panel: The panel contains a terminal for you to interact with. This can quickly be used to try out code by opening a python
interpreter, management of environments etc.
The status bar: The status bar contains information based on the extensions that you have installed. In particular for python development, the status bar can be used to change conda environment.
The overall goal of the exercises, is that you should start familiarizing yourself with the editor that you have chosen. If you are already an expert in one of them, feel free to skip the rest. You should at least be able to:
The instructions below are specific to Visual studio code but we recommend that you try to answer the questions if using another editor. In the exercise_files
folder belonging to this session we have put cheat sheets for VS code (one for Windows and one for Mac/Linux), that can give you an easy overview of the different macros in VS code. The following exercises are just to get you started but you can find many more tutorials here.
VS code is a general editor for many languages and to get proper python support we need to install some extensions. In the action bar
go to the extension
tap and search for python
in the marketplace. For here we highly recommend installing the following packages:
If you install the Python
package you should see something like this in your status bar:
which indicates that you are using the stock python installation, instead of the one you have created using conda
. Click it and change the python environment to the one you actually want to use.
One of the most useful tools in VS Code is the ability to navigate the whole project using the built-in Explorer
. To really take advantage of the VS code you need to make sure what you are working on is a project. Create a folder called hello
(somewhere on your laptop) and open it in VS Code (Click File
in the menu and then select Open Folder
). You should end up with a completely clean workspace (as shown below). Click the New file
button and create a file called hello.py
.
Image credit
Finally, lets run some code. Add something simple to the hello.py
file like:
Image credit
and click the run
button as shown in the image. It should create a new terminal, activate the environment that you have chosen and finally run your script. In addition to clicking the run
button, you can also
Shift+Enter
to run it in the terminalThat's, the basic of using VS code. We recommend highly that you revisit this tutorial during the course when we get to topics such as debugging and version control which VS code can help with.
"},{"location":"s1_development_environment/editor/#a-note-on-jupyter-notebooks-in-production-environments","title":"A note on jupyter notebooks in production environments","text":"As already stated jupyter notebooks are great for development as they allow developers to easily test our new ideas. However, they often lead to pain points when models actually need to be deployed. We highly recommend reading section 5.1.1 of this paper by Shankar et al. that in more detail discuss the strong opinions to jupyter notebooks that exist within the developer community.
All this said there at least exist one simple tool to make notebooks work better in a production setting. Its called nbconvert
and can be installed with
conda install nbconvert # or pip install nbconvert\n
You may need some further dependencies such as Pandoc, TeX and Pyppeteer for it to work (see install instructions here). After this, converting a notebook to a .py
script is a simple as:
jupyter nbconvert --to=script my_notebook.ipynb\n
which will produce a similar named script called my_notebook.py
. We highly recommend that you stick to developing scripts directly during the course to get experience with doing so, but nbconvert
can be an fantastic tool to have in your toolbox.
Core Module
Python is a great programming language and this is mostly due to its vast ecosystem of packages. No matter what you want to do, there is probably a package that can get you started. Just try to remember when the last time you wrote a program only using the python standard library? Probably never. For this reason, we need a way to install third-party packages and this is where package managers come into play.
You have probably already used pip
for the longest time, which is the default package manager for Python. pip
is great for beginners but it is missing one essential feature that you will need as a developer or data scientist: virtual environments. Virtual environments are an essential way to make sure that the dependencies of different projects do not cross-contaminate each other. As a naive example, consider project A that requires torch==1.3.0
and project B that requires torch==2.0
, then doing
cd project_A # move to project A\npip install torch==1.3.0 # install old torch version\ncd ../project_B # move to project B\npip install torch==2.0 # install new torch version\ncd ../project_A # move back to project A\npython main.py # try executing main script from project A\n
will mean that even though we are executing the main script from project A's folder, it will use torch==2.0
instead of torch==1.3.0
because that is the last version we installed, because in both cases pip
will install the package into the same environment, in this case the global environment. Instead, if we did something like:
cd project_A # move to project A\npython -m venv env # create a virtual environment in project A\nsource env/bin/activate # activate that virtual environment\npip install torch==1.3.0 # install old torch version into the virtual environment belonging to project A\ncd ../project_B # move to project B\npython -m venv env # create a virtual environment in project B\nsource env/bin/activate # activate that virtual environment\npip install torch==2.0 # install new torch version into the virtual environment belonging to project B\ncd ../project_A # move back to project A\nsource env/bin/activate # activate the virtual environment belonging to project A\npython main.py # succeed in executing main script from project A\n
cd project_A # move to project A\npython -m venv env # create a virtual environment in project A\n.\\env\\Scripts\\activate # activate that virtual environment\npip install torch==1.3.0 # install old torch version into the virtual environment belonging to project A\ncd ../project_B # move to project B\npython -m venv env # create a virtual environment in project B\n.\\env\\Scripts\\activate # activate that virtual environment\npip install torch==2.0 # install new torch version into the virtual environment belonging to project B\ncd ../project_A # move back to project A\n.\\env\\Scripts\\activate # activate the virtual environment belonging to project A\npython main.py # succeed in executing main script from project A\n
then we would be sure that torch==1.3.0
is used when executing main.py
in project A because we are using two different virtual environments. In the above case, we used the venv module which is the built-in Python module for creating virtual environments. venv+pip
is arguably a good combination but when working on multiple projects it can quickly become a hassle to manage all the different virtual environments yourself, remembering which Python version to use, which packages to install and so on.
For this reason, a number of package managers have been created that can help you manage your virtual environments and dependencies, with some of the most popular being:
with more being created every year (rye is looking like an interesting project). This is considered a problem in the Python community, because it means that there is no standard way of managing dependencies like in other languages like npm
for node.js
or cargo
for rust
.
In the course, we do not care about which package manager you use, but we do care that you use one. If you are already familiar with one package manager, then skip this exercise and continue to use that. The best recommendation that I can give regarding package managers, in general, is to find one you like and then stick with it. A lot of time can be wasted on trying to find the perfect package manager, but in the end, they all do the same with some minor differences. Checkout this blog post if you want a fairly up-to-date evaluation of the different environment management and packaging tools that exist in the Python ecosystem.
If you are not familiar with any package managers, then we recommend that you use conda
and pip
for this course. You probably already have conda installed on your laptop, which is great. What conda does especially well, is that it allows you to create virtual environments with different Python versions, which can be really useful if you encounter dependencies that have not been updated in a long time. In this course specifically, we are going to recommend the following workflow
conda
to create virtual environments with specific Python versionspip
to install packages in that environmentInstalling packages with pip
inside conda
environments has been considered a bad practice for a long time, but since conda>=4.6
it is considered safe to do so. The reason for this is that conda
now has a built-in compatibility layer that makes sure that pip
installed packages are compatible with the other packages installed in the environment.
Before we get started with the exercises, let's first talk a bit about Python dependencies. One of the most common ways to specify dependencies in the Python community is through a requirements.txt
file, which is a simple text file that contains a list of all the packages that you want to install. The format allows you to specify the package name and version number you want, with 7 different operators:
package1 # any version\npackage2 == x.y.z # exact version\npackage3 >= x.y.z # at least version x.y.z\npackage4 > x.y.z # newer than version x.y.z\npackage4 <= x.y.z # at most version x.y.z\npackage5 < x.y.z # older than version x.y.z\npackage6 ~= x.y.z # install version newer than x.y.z and older than x.y+1\n
In general, all packages (should) follow the semantic versioning standard, which means that the version number is split into three parts: x.y.z
where x
is the major version, y
is the minor version and z
is the patch version.
The reason that we need to specify the version number is that we want to make sure that we can reproduce our code at a later point. If we do not specify the version number, then we are at the mercy of the package maintainer to not change the API of the package. This is especially important when working with machine learning models, as we want to make sure that we can reproduce the exact same model at a later point.
Finally, we also need to discuss dependency resolution, which is the process of figuring out which packages are compatible. This is a non-trivial problem, and there exist a lot of different algorithms for doing this. If you have ever thought that pip
and conda
were taking a long time to install something, then it is probably because they were trying to figure out which packages are compatible with each other. For example, if you try to install
pip install \"matplotlib >= 3.8.0\" \"numpy <= 1.19\" --dry-run\n
then it would simply fail because there are no versions of matplotlib
and numpy
under the given constraints that are compatible with each other. In this case, we would need to relax the constraints to something like
pip install \"matplotlib >= 3.8.0\" \"numpy <= 1.21\" --dry-run\n
to make it work.
"},{"location":"s1_development_environment/package_manager/#exercises","title":"\u2754 Exercises","text":"For hints regarding how to use conda
you can check out the cheat sheet in the exercise folder.
Download and install conda
. You are free to either install full conda
or the much simpler version miniconda
. The core difference between the two packages is that conda
already comes with a lot of packages that you would normally have to install with miniconda
. The downside is that conda
is a much larger package which can be a huge disadvantage on smaller devices. Make sure that your installation is working by writing conda help
in a terminal and it should show you the help message for conda. If this does not work you probably need to set some system variable to point to the conda installation
If you have successfully installed conda, then you should be able to execute the conda
command in a terminal.
Conda will always tell you what environment you are currently in, indicated by the (env_name)
in the prompt. By default it will always start in the (base)
environment.
Try creating a new virtual environment. Make sure that it is called my_enviroment
and that it installs version 3.11 of Python. What command should you execute to do this?
We highly recommend that you use Python 3.8 or higher for this course. In general, we recommend that you use the second latest version of Python that is available (currently Python 3.11 as of writing this). This is because the latest version of Python is often not supported by all dependencies. You can always check the status of different Python version support here.
Which conda
command gives you a list of all the environments that you have created?
Which conda
command gives you a list of the packages installed in the current environment?
How do you easily export this list to a text file? Do this, and make sure you export it to a file called enviroment.yaml
, as conda uses another format by default than pip
.
Inspect the file to see what is in it.
The enviroment.yaml
file you have created is one way to secure reproducibility between users because anyone should be able to get an exact copy of you environment if they have your enviroment.yaml
file. Try creating a new environment directly from you enviroment.yaml
file and check that the packages being installed exactly match what you originally had.
As the introduction states, it is fairly safe to use pip
inside conda
today. What is the corresponding pip
command that gives you a list of all pip
installed packages? And how do you export this to requirements.txt
file?
If you look through the requirements that both pip
and conda
produce then you will see that it is often filled with a lot more packages than what you are actually using in your project. What you are really interested in are the packages that you import in your code: from package import module
. One way to get around this is to use the package pipreqs
, which will automatically scan your project and create a requirements file specific to that. Let's try it out:
Install pipreqs
:
pip install pipreqs\n
Either try out pipreqs
on one of your own projects or try it out on some other online project. What does the requirements.txt
file pipreqs
produces look like compared to the files produced by either pip
or conda
.
Try executing the command
pip install \"pytest < 4.6\" pytest-cov==2.12.1\n
based on the error message you get, what would be a compatible way to install these?
SolutionAs pytess-cov==2.12.1
requires a version of pytest
newer than 4.6
, we can simply change the command to be:
pip install \"pytest >= 4.6\" pytest-cov==2.12.1\n
but there of course exists other solutions as well.
This ends the module on setting up virtual environments. While the methods mentioned in the exercises are great ways to construct requirements files automatically, sometimes it is just easier to manually sit down and create the files as you in that way secure that only the most necessary requirements are installed when creating a new environment.
"},{"location":"s2_organisation_and_version_control/","title":"Getting started with MLOps - Organization and version control","text":"Slides
Today we take our first steps into the world of MLOps. The set of modules in this session focuses on getting organized and making sure that you are familiar with good development practices. While many of the practices you will learn about these modules does not seem that important when you are a single person working on a project, it is crucial when working in large groups that the difference in how different people organize and write their code is minimized. The topics in this session will focus on:
Some exercises in this course are very loosely stated (including the exercises today). You are expected to seek out information before you ask for help (Google is your friend!) as you will both learn more for trying to solve the problems yourself, and it is more realistic of how the \"real world\" works.
Learning objectives
The learning objectives of this session are:
git
to track changes to your codedvc
to version control dataCore Module
With a basic understanding of version control, it is now time to really begin filling up our code repository. However, the question then remains how to organize our code? As developers we tend to not think about code organization that much. It is instead something that just dynamically is being created as we may need it. However, maybe we should spend some time initially getting organized with the chance of this making our code easier to develop and maintain in the long run. If we do not spend time organizing our code, we may end up with a mess of code that is hard to understand or maintain
Big ball of Mud
A Big Ball of Mud is a haphazardly structured, sprawling, sloppy, duct-tape-and-baling-wire, spaghetti-code jungle. These systems show unmistakable signs of unregulated growth, and repeated, expedient repair. Information is shared promiscuously among distant elements of the system, often to the point where nearly all the important information becomes global or duplicated. The overall structure of the system may never have been well defined. If it was, it may have eroded beyond recognition. Programmers with a shred of architectural sensibility shun these quagmires. Only those who are unconcerned about architecture, and, perhaps, are comfortable with the inertia of the day-to-day chore of patching the holes in these failing dikes, are content to work on such systems. Brian Foote and Joseph Yoder, Big Ball of Mud. Fourth Conference on Patterns Languages of Programs (PLoP '97/EuroPLoP '97) Monticello, Illinois, September 1997
We are here going to focus on the organization of data science projects and machine learning projects. The core difference this kind of projects introduces compared to more traditional systems, is data. The key to modern machine learning is without a doubt the vast amounts of data that we have access to today. It is therefore not unreasonable that data should influence our choice of code structure. If we had another kind of application, then the layout of our codebase should probably be different.
"},{"location":"s2_organisation_and_version_control/code_structure/#cookiecutter","title":"Cookiecutter","text":"We are in this course going to use the tool cookiecutter, which is tool for creating projects from project templates. A project template is in short just na overall structure of how you want your folders, files etc. to be organised from the beginning. For this course we are going to be using a custom MLOps template. The template is essentially a fork of the cookiecutter data science template template that has been used for a couple of years in the course, but specialized a bit more towards MLOps instead of general data science.
We are not going to argue that this template is better than every other template, we are just focusing on that it is a standardized way of creating project structures for machine learning projects. By standardized we mean, that if two persons are both using cookiecutter
with the same template, the layout of their code does follow some specific rules, enabling one to faster understand the other person's code. Code organization is therefore not only to make the code easier for you to maintain but also for others to read and understand.
Below is seen the default code structure of cookiecutter for data science projects.
What is important to keep in mind when using a template, is that it exactly is a template. By definition a template is guide to make something. Therefore, not all parts of an template may be important for your project at hand. Your job is to pick the parts from the template that is useful for organizing your machine learning project and add the parts that are missing.
"},{"location":"s2_organisation_and_version_control/code_structure/#python-projects","title":"Python projects","text":"While the same template in principal could be used regardless of what language we were using for our machine learning or data science application, there are certain considerations to take into account based on what language we are using. Python is the dominant language for machine learning and data science currently, which is why we in this section are focusing on some of the special files you will need for your Python projects.
The first file you may or may not know is the __init__.py
file. In Python the __init__.py
file is used to mark a directory as a Python package. Therefore as a bare minimum, any Python package should look something like this:
\u251c\u2500\u2500 src/\n\u2502 \u251c\u2500\u2500 __init__.py\n\u2502 \u251c\u2500\u2500 file1.py\n\u2502 \u251c\u2500\u2500 file2.py\n\u251c\u2500\u2500 pyproject.toml\n
The second file to focus on is the pyproject.toml
. This file is important for actually converting your code into a Python project. Essentially, whenever you run pip install
, pip
is in charge of both downloading the package you want but also in charge of installing it. For pip
to be able to install a package it needs instructions on what part of the code it should install and how to install it. This is the job of the pyproject.toml
file.
Below we have both added a description of the structure of the pyproject.toml
file but also setup.py + setup.cfg
which is the \"old\" way of providing project instructions regarding Python project. However, you may still encounter a lot of projects using setup.py + setup.cfg
so it is good to at least know about them.
pyproject.toml
is the new standardized way of describing project metadata in a declaratively way, introduced in PEP 621. It is written toml format which is easy to read. At the very least your pyproject.toml
file should include the [build-system]
and [project]
sections:
[build-system]\nrequires = [\"setuptools\", \"wheel\"]\nbuild-backend = \"setuptools.build_meta\"\n\n[project]\nname = \"my-package-name\"\nversion = \"0.1.0\"\nauthors = [{name = \"EM\", email = \"me@em.com\"}]\ndescription = \"Something cool here.\"\nrequires-python = \">=3.8\"\ndynamic = [\"dependencies\"]\n\n[tool.setuptools.dynamic]\ndependencies = {file = [\"requirements.txt\"]}\n
the [build-system]
informs pip
/python
that to build this Python project it needs the two packages setuptools
and wheels
and that it should call the setuptools.build_meta function to actually build the project. The [project]
section essentially contains metadata regarding the package, what its called etc. if we ever want to publish it to PyPI.
For specifying dependencies of your project you have two options. Either you specify them in a requirements.txt
file and it as a dynamic field in pyproject.toml
as shown above. Alternatively, you can add a dependencies
field under the [project]
header like this:
[project]\ndependencies = [\n 'torch==2.1.0',\n 'matplotlib>=3.8.1'\n]\n
The improvement over setup.py + setup.cfg
is that pyproject.toml
also allows for metadata from other tools to be specified in it, essentially making sure you only need a single file for your project. For example, in the next [module M7 on good coding practices] you will learn about the tool ruff
and how it can help format your code. If we want to configure ruff
for our project we can do that directly in pyproject.toml
by adding additional headers:
[ruff]\nruff_option = ...\n
To read more about how to specify pyproject.toml
this page is a good place to start.
setup.py
is the original way to describing how a Python package should be build. The most basic setup.py
file will look like this:
from setuptools import setup\nfrom pip.req import parse_requirements\nrequirements = [str(ir.req) for ir in parse_requirements(\"requirements.txt\")]\nsetup(\n name=\"my-package-name\",\n version=\"0.1.0\",\n author=\"EM\",\n description=\"Something cool here.\"\n install_requires=requirements,\n)\n
Essentially, the it is the exact same meta information as in pyproject.toml
, just written directly in Python syntax instead of toml
. Because there was a wish to deperate this meta information into a separate file, the setup.cfg
file was created which can contain the exact same information as setup.py
just in a declarative config.
[metadata]\nname = my-package-name\nversion = 0.1.0\nauthor = EM\ndescription = \"Something cool here.\"\n# ...\n
This non-standardized way of providing meta information regarding a package was essentially what lead to the creation of pyproject.toml
.
Regardless of what way a project is configured, after creating the above files the correct way to install them would be the same
pip install .\n# or in developer mode\npip install -e . # (1)!\n
-e
is short for --editable
mode also called developer mode. Since we will continuously iterating on our package this is the preferred way to install our package, because that means that we do not have to run pip install
every time we make a change. Essentially, in developer mode changes in the Python source code can immediately take place without requiring a new installation.after running this your code should be available to import as from project_name import ...
like any other Python package you use. This is the most essential you need to know about creating Python packages.
After having installed cookiecutter (exercise 1 and 2), the remaining exercises are intended to be used on taking the simple CNN MNIST classifier from yesterdays exercise and force it into this structure. You are not required to fill out every folder and file in the project structure, but try to at least follow the steps in exercises. Whenever you need to run a file I recommend always doing this from the root directory e.g.
python <project_name>/data/make_dataset.py data/raw data/processed\npython <project_name>/models/train_model.py <arguments>\netc...\n
in this way paths (for saving and loading files) are always relative to the root.
Install cookiecutter framework
pip install cookiecutter\n
Start a new project using this template, that is specialized for this course (1).
You do this by running the cookiecutter command using the template url:
cookiecutter <url-to-template>\n
Valid project names
When asked for a project name you should follow the PEP8 guidelines for naming packages. This means that the name should be all lowercase and if you want to separate words, you should use underscores. For example my_project
is a valid name, while MyProject
is not. Additionally, the packaage name cannot start with a number.
There are two common choices on how layout your source directory. The first is called src-layout where the source code is always place in a src/<project_name>
folder and the second is called flat-layout where the source code is place is just placed in a <project_name>
folder. The template we are using in this course is using the flat-layout, but there are pros and cons for both.
After having created your new project, the first step is to also create a corresponding virtual environment and install any needed requirements. If you have a virtual environment from yesterday feel free to use that else create a new. Then install the project in that environment
pip install -e .\n
Start by filling out the <project_name>/data/make_dataset.py
file. When this file runs, it should take the raw data e.g. the corrupted MNIST files from yesterday (../data/corruptmnist
) which now should be located in a data/raw
folder and process them into a single tensor, normalize the tensor and save this intermediate representation to the data/processed
folder. By normalization here we refer to making sure the images have mean 0 and standard deviation 1.
This template comes with a Makefile
that can be used to easily define common operations in a project. You do not have to understand the complete file but try taking a look at it. In particular the following commands may come in handy
make data # runs the make_dataset.py file, try it!\nmake clean # clean __pycache__ files\nmake requirements # install everything in the requirements.txt file\n
Windows users make
is a GNU build tool that is by default not available on Windows. There are two recommended ways to get it running on Windows. The first is leveraging linux subsystem for Windows which you maybe have already installed. The second option is utilizing the chocolatey package manager, which enables Windows users to install packages similar to Linux system.
In general we recommend that you add commands to the Makefile
as you move along in the course. If you want to know more about how to write Makefile
s then this is an excellent video.
Put your model file (model.py
) into <project_name>/models
folder together and insert the relevant code from the main.py
file into the train_model.py
file. Make sure that whenever a model is trained and it is saved, that it gets saved to the models
folder (preferably in sub-folders).
When you run train_model.py
, make sure that some statistics/visualizations from the trained models gets saved to the reports/figures/
folder. This could be a simple .png
of the training curve.
(Optional) Can you figure out a way to add a train
command to the Makefile
such that training can be started using
make train\n
Fill out the newly created <project_name>/models/predict_model.py
file, such that it takes a pre-trained model file and creates prediction for some data. Recommended interface is that users can give this file either a folder with raw images that gets loaded in or a numpy
or pickle
file with already loaded images e.g. something like this
python <project_name>/models/predict_model.py \\\n models/my_trained_model.pt \\ # file containing a pretrained model\n data/example_images.npy # file containing just 10 images for prediction\n
Fill out the file <project_name>/visualization/visualize.py
with this (as minimum, feel free to add more visualizations)
reports/figures/
folder.(Optional) Feel free to create more files/visualizations (what about investigating/explore the data distribution?)
Make sure to update the README.md
file with a short description on how your scripts should be run
Finally make sure to update the requirements.txt
file with any packages that are necessary for running your code (see this set of exercises for help)
(Optional) Lets say that you are not satisfied with the template I have recommended that you use, which is completely fine. What should you then do? You should of course create your own template! This is actually not that hard to do.
Just for a starting point I would recommend that you fork either the mlops template which you have already been using or alternatively fork the data science template template.
After forking the template, clone it down locally and lets start modifying it. The first step is changing the cookiecutter.json
file. For the mlops template it looks like this:
{\n \"project_name\": \"project_name\",\n \"repo_name\": \"{{ cookiecutter.project_name.lower().replace(' ', '_') }}\",\n \"author_name\": \"Your name (or your organization/company/team)\",\n \"description\": \"A short description of the project.\",\n \"python_version_number\": \"3.10\",\n \"open_source_license\": [\"No license file\", \"MIT\", \"BSD-3-Clause\"]\n}\n
simply add a new line to the json file with the name of the variable you want to add and the default value you want it to have.
The actual template is located in the {{ cookiecutter.project_name }}
folder. cookiecutter
works by replacing everywhere that it sees {{ cookiecutter.<variable_name> }}
with the value of the variable. Therefore, if you want to add a new file to the template, just add it to the {{ cookiecutter.project_name }}
folder and make sure to add the {{ cookiecutter.<variable_name> }}
where you want the variable to be replaced.
After you have made the changes you want to the template, you should test it locally. Just run
cookiecutter . -f --no-input\n
and it should create a new folder using the default values of the cookiecutter.json
file.
Finally, make sure to push any changes you made to the template to GitHub, such that you in the future can use it by simply running
cookiecutter https://github.com/<username>/<my_template_repo>\n
Starting from complete scratch, what is the steps needed to create a new github repository and push a specific template to it as the very first commit.
SolutionCreate a completely barebone repository, either using the GitHub UI or if you have the github cli installed (not git
) you can run
gh repo create <repo_name> --public --confirm\n
Run cookiecutter
with the template you want to use
cookiecutter <template>\n
The name of the folder created by cookiecutter
should be the same as you just used.
Run the following sequence of commands
cd <project_name>\ngit init\ngit add .\ngit commit -m \"Initial commit\"\ngit remote add origin https://github.com/<username>/<repo_name>\ngit push origin master\n
That's it. The template should now have been pushed to the repository as the first commit.
That ends the module on code structure and cookiecutter
. We again want to stress the point of using cookiecutter
is not about following one specific template, but instead just to use any template for organizing your code. What often happens in a team is that multiple templates are needed in different stages of the development phase or for different product types because they share common structure, while still having some specifics. Keeping templates up-to-date then becomes critical such that no team member is using an outdated template. If you ever end up in this situation, we highly recommend to checkout cruft that works alongside cookiecutter
to not only make projects but update existing ones as template evolves. Cruft additionally also has template validation capabilities to ensure projects match the latest version of a template.
Core Module
In this module, we are going to return to version control. However, this time we are going to focus on version control of data. The reason we need to separate between standard version control and data version control comes down to one problem: size.
Classic version control was developed to keep track of code files, which all are simple text files. Even a codebase that contains 1000+ files with millions of lines of code can probably be stored in less than a single gigabyte (GB). On the other hand, the size of data can be drastically bigger. As most machine learning algorithms only get better with the more data that you feed them, we are seeing models today that are being trained on petabytes of data (1.000.000 GB).
Because this is an important concept there exist a couple of frameworks that have specialized in versioning data such as DVC, DAGsHub, Hub, Modelstore and ModelDB. Regardless of what framework, they all implement somewhat the same concept: instead of storing the actual data files or in general storing any large artifacts files we instead store a pointer to these large flies. We then version control the point instead of the artifact.
Image creditWe are in this course going to use DVC
provided by iterative.ai as they also provide tools for automatizing machine learning, which we are going to focus on later.
DVC (Data Version Control) is simply an extension of git
to not only take versioning data but also models and experiments in general. But how does it deal with these large data files? Essentially, DVC
will just keep track of a small metafile that will then point to some remote location where your original data is stored. Metafiles essentially work as placeholders for your data files. Your large data files are then stored in some remote location such as Google Drive or an S3
bucket from Amazon.
Image credit
As the figure shows, we now have two remote locations: one for code and one for data. We use git pull/push
for the code and dvc pull/push
for the data. The key concept is the connection between the data file model.pkl
which is fairly large and its respective metafile model.pkl.dvc
which is very small. The large file is stored in the data remote and the metafile is stored in the code remote.
If in doubt about some of the exercises, we recommend checking out the documentation for DVC as it contains excellent tutorials.
For these exercises, we are going to use Google drive as a remote storage solution for our data. If you do not already have a Google account, please create one (we are going to use it again in later exercises). Please make sure that you at least have 1GB of free space.
Next, install DVC and the Google Drive extension
pip install dvc\npip install \"dvc[gdrive]\"\n
If you installed DVC via pip and plan to use cloud services as remote storage, you might need to install these optional dependencies: [s3], [azure], [gdrive], [gs], [oss], [ssh]. Alternatively, use [all] to include them all. If you encounter that the installation fails, we recommend that you start by updating pip and then trying to update dvc
:
pip install -U pip\npip install -U \u201ddvc[gdrive]\u201d\n
If this does not work for you, it is most likely due to a problem with pygit2
and in that case we recommend that you follow the instructions here.
In your MNIST repository run the following command from the terminal
dvc init\n
this will setup dvc
for this repository (similar to how git init
will initialize a git repository). These files should be committed using standard git
to your repository.
Go to your Google Drive and create a new folder called dtu_mlops_data
. Then copy the unique identifier belonging to that folder as shown in the figure below
Using this identifier, add it as a remote storage
dvc remote add -d storage gdrive://<your_identifier>\n
Check the content of the file .dvc/config
. Does it contain a pointer to your remote storage? Afterwards, make sure to add this file to the next commit we are going to make:
git add .dvc/config\n
Call the dvc add
command on your data files exactly like you would add a file with git
(you do not need to add every file by itself as you can directly add the data/
folder). Doing this should create a human-readable file with the extension .dvc
. This is the metafile as explained earlier that will serve as a placeholder for your data. If you are on Windows and this step fails you may need to install pywin32
. At the same time, the data
folder should have been added to the .gitignore
file that marks which files should not be tracked by git. Confirm that this is correct.
Now we are going to add, commit and tag the metafiles so we can restore to this stage later on. Commit and tag the files, which should look something like this:
git add data.dvc .gitignore\ngit commit -m \"First datasets, containing 25000 images\"\ngit tag -a \"v1.0\" -m \"data v1.0\"\n
Finally, push your data to the remote storage using dvc push
. You will be asked to authenticate, which involves copy-pasting the code in the link prompted. Check out your Google Drive folder. You will see that the data is not in a recognizable format anymore due to the way that dvc
packs and tracks the data. The boring detail is that dvc
converts the data into content-addressable storage which makes data much faster to get. Finally, make sure that your data is not stored in your Github repository.
After authenticating the first time, DVC
should be setup without having to authenticate again. If you for some reason encounter that dvc fails to authenticate, you can try to reset the authentication. Locate the file $CACHE_HOME/pydrive2fs/{gdrive_client_id}/default.json
where $CACHE_HOME
depends on your operating system:
~/Library/Caches
~/.cache
This is the typical location, but it may vary depending on what distro you are running
{user}/AppData/Local
Delete the complete {gdrive_client_id}
folder and retry authenticating with dvc push
.
After completing the above steps, it is very easy for others (or yourself) to get setup with both code and data by simply running
git clone <my_repository>\ncd <my_repository>\ndvc pull\n
(assuming that you give them access right to the folder in your drive). Try doing this (in some other location than your standard code) to make sure that the two commands indeed download both your code and data.
Lets look about the process of updating our data. Remember the important aspect of version control is that we do not need to store explicit files called data_v1.pt
, data_v2.pt
etc. but just have a single data.pt
that where we can always checkout earlier versions. Initially start by copying the data data/corruptmnist_v2
folder from this repository to your MNIST code. This contains 3 extra datafiles with 15000 additional observations. Rerun your data pipeline so these gets incorporated into the files in your processed
folder.
Redo the above steps, adding the new data using dvc
, committing and tagging the metafiles e.g. the following commands should be executed (with appropriate input):
dvc add -> git add -> git commit -> git tag -> dvc push -> git push
.
Lets say that you wanted to go back to the state of your data in v1.0. If the above steps have been done correctly, you should be able to do this using:
git checkout v1.0\ndvc checkout\n
confirm that you have reverted to the original data.
(Optional) Finally, it is important to note that dvc
is not only intended to be used to store data files but also any other large files such as trained model weights (with billions of parameters these can be quite large). For example, if we always store our best-performing model in a file called best_model.ckpt
then we can use dvc
to version control it, store it online and make it easy for others to download. Feel free to experiment with this using your model checkpoints.
In general dvc
is a great framework for version-controlling data and models. However, it is important to note that it does have some performance issue when dealing with datasets that consist of many files. Therefore, if you are ever working with a dataset that consists of many small files, it can be a good idea to:
zip files into a single archive and then version control the archive. The zip
archive should be placed in a data/raw
folder and then unzipped in the data/processed
folder.
If possible turn your data into 1D arrays, then it can be stored in a single file such as .parquet
or .csv
. This is especially useful for tabular data. Then you can version control the single file instead of the many files.
How do you know that a repository is using dvc?
SolutionSimilar to a git repository having a .git
directory, a repository using dvc needs to have a .dvc
folder. Alternatively you can you the dvc status
command.
Assume you just added a folder called data/
that you want to track with dvc
. What is the sequence of 5 commands to successful version control the folder? (assuming you already setup a remote)
dvc add data/\ngit add .\ngit commit -m \"added raw data\"\ngit push\ndvc push\n
That's all for today. With the combined power of git
and dvc
we should be able to version control everything in our development pipeline such that no changes are lost (assuming we commit regularly). It should be noted that dvc
offers more than just data version control, so if you want to deep dive into dvc
we recommend their pipeline feature and how this can be used to setup version controlled experiments. Note that we are going to revisit dvc
later for a more permanent (and large-scale) storage solution.
Core Module
Proper collaboration with other people will require that you can work on the same codebase in an organized manner. This is the reason that version control exist. Simply stated, it is a way to keep track of:
For a full explanation please see this page
Secondly, it is important to note that GitHub is not git! GitHub is the dominating player when it comes to hosting repositories but that does not mean that they are the only one providing free repository hosting (see bitbucket or gitlab) for some other examples).
That said we will be using git and GitHub throughout this course. It is a requirement for passing this course that you create a public repository with your code and use git to upload any code changes. How much you choose to integrate this into your own projects depends, but you are at least expected to be familiar with git+GitHub.
Image credit"},{"location":"s2_organisation_and_version_control/git/#initial-config","title":"Initial config","text":"What does Git stand for?
The name \"git\" was given by Linus Torvalds when he wrote the very first version. He described the tool as \"the stupid content tracker\" and the name as (depending on your mood):
Install git on your computer and make sure that your installation is working by writing git help
in a terminal and it should show you the help message for git.
Create a GitHub account if you do not already have one.
To make sure that we do not have to type in our GitHub username every time that we want to do some changes, we can once and for all set them on our local machine
# type in a terminal\ngit config credential.helper store\ngit config --global user.email <email>\n
The most simple way to think of version control, is that it is just nodes with lines connecting them
Each node, which we call a commit is uniquely identified by a hash string. Each node, stores what our code looked like at that point in time (when we made the commit) and using the hash codes we can easily revert to a specific point in time.
The commits are made up of local changes that we make to our code. A basic workflow for adding commits are seen below
Assuming that we have made some changes to our local working directory and that we want to get these updates to be online in the remote repository we have to do the following steps:
First we run the command git add
. This will move our changes to the staging area. While changes are in the staging area we can very easily revert them (using git restore
). There have therefore not been assigned a unique hash to the code yet, and we can therefore still overwrite it.
To take our code from the staging area and make it into a commit, we simply run git commit
which will locally add a note to the graph. It is important again, that we have not pushed the commit to the online repository yet.
Finally, we want others to be able to use the changes that we made. We do a simple git push
and our commit gets online
Of course, the real power of version control is the ability to make branches, as in the image below
Image creditEach branch can contain code that are not present on other branches. This is useful when you are many developers working together on the same project.
"},{"location":"s2_organisation_and_version_control/git/#exercises","title":"\u2754 Exercises","text":"In your GitHub account create an repository, where the intention is that you upload the code from the final exercise from yesterday
After creating the repository, clone it to your computer
git clone https://github.com/my_user_name/my_repository_name.git\n
Move/copy the three files from yesterday into the repository (and any other that you made)
Add the files to a commit by using git add
command (1)
Commit the files using git commit
Finally push the files to your repository using git push
. Make sure to check online that the files have been updated in your repository.
You can always use the command git status
to check where you are in the process of making a commit.
Also checkout the git log
command, which will show you the history of commits that you have made.
Make sure that you understand how to make branches, as this will allow you to try out code changes without messing with your working code. Creating a new branch can be done using:
# create a new branch\ngit checkout -b <my_branch_name>\n
Afterwards, you can use git checkout
to change between branches (remember to commit your work!) Try adding something (a file, a new line of code etc.) to the newly created branch, commit it and try changing back to master afterwards. You should hopefully see whatever you added on the branch is not present on the main branch.
If you do not already have a cloned version of this repository belonging to the course, make sure to make one! I am continuously updating/changing some of the material during the course and I therefore recommend that you each day before the lecture do a git pull
on your local copy
Git may seem like a waste of time when solutions like dropbox, google drive etc exist, and it is not completely untrue when you are only one or two working on a project. However, these file management systems falls short when hundreds to thousands of people work together. For this exercise you will go through the steps of sending an open-source contribution:
Go online and find a project you do not own, where you can improve the code. You can either look at this page of good issues to get started with or for simplicity you can just choose the repository belonging to the course. Now fork the project by clicking the Fork button.
This will create a local copy of the repository which you have complete writing access to. Note that code updates to the original repository does not update code in your local repository.
Clone your local fork of the project using git clone
.
As default your local repository will be on the main branch
(HINT: you can check this with the git status
command). It is good practice to make a new branch when working on some changes. Use the git branch
command followed by the git checkout
command to create a new branch.
You are now ready to make changes to the repository. Try to find something to improve (any spelling mistakes?). When you have made the changes, do the standard git cycle: add -> commit -> push
Go online to the original repository and go to the Pull requests
tab. Find compare
button and choose the button to compare the master branch
of the original repo with the branch that you just created in your own repository. Check the diff on the page to make sure that it contains the changes you have made.
Write a bit about the changes you have made and click Create pull request
:)
Forking a repository has the consequence that your fork and the repository that you forked can diverge. To mitigate this we can set what is called an remote upstream. Take a look on this page , and set a remote upstream for the repository you just forked.
After setting the upstream branch, we need to pull and merge any update. Take a look on this page and figure out how to do this.
As a final exercise we want to simulate a merge conflict, which happens when two users try to commit changes to exactly same lines of code in the codebase, and git is not able to resolve how the different commits should be integrated.
In your browser, open your favorite repository (it could be the one you just worked on), go to any file of your choosing and click the edit button (see image below) and make some change to the file. For example, if you choose a python file you can just import some random packages at the top of the file. Commit the change.
Make sure not to pull the change you just made to your local computer. Locally make changes to the same file in the same lines and commit them afterwards.
Now try to git pull
the online changes. What should (hopefully) happen is that git will tell you that it found a merge conflict that needs to be resolved. Open the file and you should see something like this
<<<<<<< HEAD\nthis is some content to mess with\ncontent to append\n=======\ntotally different content to merge later\n>>>>>>> master\n
this should be interpret as: everything that's between <<<<<<<
and =======
are the changes made by your local commit and everything between =======
and >>>>>>>
are the changes you are trying to pull. To fix the merge conflict you simply have to make the code in the two \"cells\" work together. When you are done, remove the identifiers <<<<<<<
, =======
and >>>>>>>
.
Finally, commit the merge and try to push.
(Optional) The above exercises have focused on how to use git from the terminal, which I highly recommend learning. However, if you are using a proper editor they also have build in support for version control. We recommend getting familiar with these features (here is a tutorial for VS Code)
How do you know if a certain directory is a git repository?
SolutionYou can check if there is a \".git\" directory. Alternative you can use the git status
command.
Explain what the file gitignore
is used for?
The file gitignore
is used to tell git which files to ignore when doing a git add .
command. This is useful for files that are not part of the codebase, but are needed for the code to run (e.g. data files) or files that contain sensitive information (e.g. .env
files that contain API keys and passwords).
You have two branches - main and devel. What sequence of commands would you need to execute to make sure that devel is in sync with main?
Solutiongit checkout main\ngit pull\ngit checkout devel\ngit merge main\n
What best practices are you familiar with regarding version control?
SolutionThat covers the basics of git to get you started. In the exercise folder you can find a git cheat sheet with the most useful commands for future reference. Finally, we want to point out another awesome feature of GitHub: in browser editor. Sometimes you have a small edit that you want to make, but still would like to do this in a IDE/editor. Or you may be in the situation where you are working from another device than your usual developer machine. GitHub has an built-in editor that can simply be enabled by changing any URL from
https://github.com/username/repository\n
to
https://github.dev/username/repository\n
Try it out on your newly created repository.
"},{"location":"s2_organisation_and_version_control/good_coding_practice/","title":"M7 - Good coding practice","text":""},{"location":"s2_organisation_and_version_control/good_coding_practice/#good-coding-practice","title":"Good coding practice","text":"Quote
Code is read more often than it is written. Guido Van Rossum (author of Python)
It is hard to define exactly what good coding practises are. But the above quote by Guido does hint at what it could be, namely that it has to do with how others observe and persive your code. In general, good coding practice is about making sure that you code is easy to read and understand, not only by others but also by your future self. The key concept to keep in mind with we are talking about good coding practice is consistency. In many cases it does not matter exactly how you choose to style your code etc., the important part is that you are consistent about it.
Image credit"},{"location":"s2_organisation_and_version_control/good_coding_practice/#documentation","title":"Documentation","text":"Most programmers have a love-hate relationship with documentation: We absolute hate writing it ourself, but love when someone else has actually taken time to add it to their code. There is no doubt about that well documented code is much easier to maintain, as you do not need to remember all details about the code to still maintain it. It is key to remember that good documentation saves more time, than it takes to write.
The problem with documentation is that there is no right or wrong way to do it. You can end up doing:
Under documentation: You document information that is clearly visible from the code and not the complex parts that are actually hard to understand.
Over documentation: Writing too much documentation will have the opposite effect on most people than what you want: there is too much to read, so people will skip it.
Writing good documentation is a skill that takes time to train, so lets try to do it.
Quote
Code tells you how; Comments tell you why. Jeff Atwood
"},{"location":"s2_organisation_and_version_control/good_coding_practice/#exercises","title":"\u2754 Exercises","text":"Go over the most complicated file in your project. Be critical and add comments where the logic behind the code is not easily understandable. (1)
In deep learning we often work with multi-dimensional tensors that constantly changes shape after each operation. It is good practise to annotate with comments when tensors undergoes some reshaping. In the following example we compute the pairwise euclidean distance between two tensors using broadcasting which results in multiple shape operations.
x = torch.randn(5, 10) # N x D\ny = torch.randn(7, 10) # M x D\nxy = x.unsqueeze(1) - y.unsqueeze(0) # (N x 1 x D) - (1 x M x D) = (N x M x D)\npairwise_euc_dist = xy.abs().pow(2.0).sum(dim=-1) # N x M\n
Add docstrings to at least two python function/methods. You can see here (example 5) a good example how to use identifiable keywords such as Parameters
, Args
, Returns
which standardizes the way of writing docstrings.
While python already enforces some styling (e.g. code should be indented in a specific way), this is not enough to secure that code from different users actually look like each other. Maybe even more troubling is that you will often see that your own style of coding changes as you become more and more experienced. This kind of difference in coding style is not that important to take care of when you are working on a personal project, but when working multiple people together on the same project it is important to consider.
The question then remains what styling you should use. This is where Pep8 comes into play, which is the official style guide for python. It is essentially contains what is considered \"good practice\" and \"bad practice\" when coding python.
The many years the most commonly used tool to check if you code is PEP8 compliant is to use flake8. However, we are in this course going to be using ruff that are quickly gaining popularity due to how fast it is and how quickly the developers are adding new features. (1)
flake8
and ruff
is what is called a linter or lint tool, which is any kind of static code analyze program that is used to flag programming errors, bugs, and styling errors.Install ruff
pip install ruff\n
Run ruff
on your project or part of your project
ruff check . # Lint all files in the current directory (and any subdirectories)\nruff check path/to/code/ # Lint all files in `/path/to/code` (and any subdirectories).\n
are you PEP8 compliant or are you a normal mortal?
You could go and fix all the small errors that ruff
is giving. However, in practice large projects instead relies on some kind of code formatter, that will automatically format your code for you to be PEP8 compliant. Some of the biggest formatters for the longest time in Python have been black and yapf, but we are going to use ruff
which also have a build in formatter that should be a drop-in replacement for black
.
Try to use ruff format
to format your code
ruff format . # Format all files in the current directory.\nruff format /path/to/file.py # Format a single file.\n
By default ruff
will apply a selection of rules when we are either checking it or formatting it. However, many more rules can be activated through configuration. If you have completed module M6 on code structure you will have encountered the pyproject.toml
file, which can store both build instructions about our package but also configuration of developer tools. Lets try to configure ruff
using the pyproject.toml
file.
One aspect that is not covered by PEP8 is how import
statements in Python should be organized. If you are like most people, you place your import
statements at the top of the file and they are ordered simply by when you needed them. A better practice is to introduce some clear structure in our imports. In older versions of this course we have used isort to do the job, but we are here going to configure ruff
to do the job. In your pyproject.toml
file add the following lines
[tool.ruff]\nselect = [\"I\"]\n
and try re-running ruff check
and ruff format
. Hopefully this should reorganize your imports to follow common practice. (1)
os
) in one block, followed by third-party dependencies (like torch
) in a second block and finally imports from your own package in a third block. Each block is then put in alphabetical order.One PEP8 styling rule that is often diverged from is the recommended line length of 79 characters, which by many (including myself) is considered very restrictive. If you code consist of multiple levels of indentation, you can quikly run into 79 characters being limiting. For this reason many projects increase it, often to 120 characters which seems to be the sweet spot of how many characters fits in a coding window on a laptop. Add the line
line-length=120\n
under the [tool.ruff]
section in the pyproject.toml
file and rerun ruff check
and ruff format
on your code.
Experiment yourself with further configuration of ruff
. In particular we recommend adding more rules and looking [tool.ruff.pydocstyle]
configuration to indicate how you have styled your documentation.
In addition to writing documentation and following a specific styling, in python we have a third way of improving the quality of our code: through typing. Typing goes back to the earlier programming languages like c
, c++
etc. where data types needed to be explicit stated for variables:
int main() {\n int x = 5 + 6;\n float y = 0.5;\n cout << \"Hello World! \" << x << std::endl();\n}\n
This is not required by python but it can really improve the readability of code, that you can directly read from the code what the expected types of input arguments and returns are. In python the :
character have been reserved for type hints. Here is one example of adding typing to a function:
def add2(x: int, y: int) -> int:\n return x+y\n
here we mark that both x
and y
are integers and using the arrow notation ->
we mark that the output type is also an integer. Assuming that we are also going to use the function for floats and torch.Tensor
s we could improve the typing by specifying a union of types. Depending on the version of python you are using the syntax for this can be different.
from torch import Tensor # note it is Tensor with upper case T. This is the base class of all tensors\nfrom typing import Union\ndef add2(x: Union[int, float, Tensor], y: Union[int, float, Tensor]) -> Union[int, float, Tensor]:\n return x+y\n
from torch import Tensor # note it is Tensor with upper case T. This is the base class of all tensors\ndef add2(x: int | float | Tensor, y: int | float | Tensor) -> int | float | Tensor:\n return x+y\n
Finally, since this is a very generic function it also works on numpy
arrays etc. we can always default to the Any
type if we are not sure about all the specific types that a function can take
from typing import Any\ndef add2(x: Any, y: Any) -> Any:\n return x+y\n
However, in this case we basically is in the same case as if our function were not typed, as the type hints does not help us at all. Therefore, use Any
only when necessary.
Exercise files
We provide a file called typing_exercise.py
. Add typing everywhere in the file. Please note that you will need the following import:
from typing import Callable, Optional, Tuple, Union, List # you will need all of them in your code\n
for it to work. This cheat sheet is a good resource on typing. We also provide typing_exercise_solution.py
, but try to solve the exercise yourself.
mypy is what is called a static type checker. If you are using typing in your code, then a static type checker can help you find common mistakes. mypy
does not run your code, but it scans it and checks that the types you have given are compatible. Install mypy
pip install mypy\n
Try to run mypy
on the typing.py
file
mypy typing_exercise.py\n
If you have solved exercise 11 correctly then you should get no errors. If not mypy
should tell you where your types are incompatible.
According to PEP8 what is wrong with the following code?
class myclass(nn.Module):\n def TrainNetwork(self, X, y):\n ...\n
Solution According to PEP8 classes should follow the CapWords convention, meaning that the first letter in each word of the class name should be capitalized. Thus myclass
should therefore be MyClass
. On the other hand, functions and methods should be full lowercase with words separated by underscore. Thus TrainNetwork
should be train_network
.
What would be the of argument x
for a function def f(x):
if it should support the following input
x1 = [1, 2, 3, 4]\nx2 = (1, 2, 3, 4)\nx3 = None\nx4 = {1: \"1\", 2: \"2\", 3: \"3\", 4: \"4\"}\n
Solution The easy solution would be to do def f(x : Any)
. But instead we could also go with:
def f(x: None | Tuple[int, ...] | List[int] | Dict[int, str]):\n
alternatively, we could also do
def f(x: None | Iterable[int]):\n
because both list
, tuple
and dict
are iterables and therefore can be covered by one type (in this specific case).
This ends the module on coding style. We again want to emphazize that a good coding style is more about having a consistent style than strictly following PEP8. A good example of this is Google, that have their own style guide for Python. This guide does not match PEP8 exactly, but it makes sure that different teams within google that are working on different projects are still to a large degree following the same style and therefore if a project is handed from one team to another then at least that will not be a problem.
"},{"location":"s3_reproducibility/","title":"Reproducibility","text":"Slides
Today is all about reproducibility - one of those concepts that everyone agrees is very important and something should be done about, but the reality is that it is very hard to secure full reproducibility. The last sessions have already touched a bit on how tools like conda
and code organization can help make code more reproducible. Today we are going all the way to ensure that our scripts and our computing environment are fully reproducible.
Reproducibility is closely related to the scientific method:
Observe -> Question -> Hypotheses -> Experiment -> Conclude -> Result -> Observe -> ...
Not having reproducible experiments essentially breaks the cycle between doing experiments and making conclusions. If experiments are not reproducible, then we do not expect that others will arrive at the same conclusion as ourselves. As machine learning experiments are fundamentally the same as doing chemical experiments in a laboratory, we should be equally careful in making sure our environments are reproducible (think of your laptop as your laboratory).
Secondly, if we focus on why reproducibility matters especially in machine learning, it is part of the bigger challenge of making sure that machine learning is trustworthy.
Many different aspects are needed if trustworthy machine learning is ever going to be a reality. We need robustness of our pipelines so we can trust that they do not fail under heavy load. We need integrity to make sure that pipelines are deployed if they are of high quality. We need explainability to make sure that we understand what our machine learning models are doing, so it is not just a black box. We need reproducibility to make sure that the results of our models can be reproduced over and over again. Finally, we need fairness to make sure that our models are not biased toward specific populations. Figure inspired by this paper.Trustworthy ML is the idea that machine learning agents can be trusted. Take the example of a machine learning agent being responsible for medical diagnoses. It is s very clear that we need to be able to trust that the agent gives us the correct diagnosis for the system to work in practice. Reproducibility plays a big role here, because without we cannot be sure that the same agent deployed at two different hospitals will give the same diagnosis (given the same input).
Learning objectives
The learning objectives of this session are:
docker
to create a reproducible container, including how to build them from scratchhydra
to integrate with config filesWith docker we can make sure that our compute environment is reproducible, but that does not mean that all our experiments magically become reproducible. There are other factors that are important for creating reproducible experiments.
In this paper (highly recommended read) the authors tried to reproduce the results of 255 papers and tried to figure out which factors where significant to succeed. One of those factors were \"Hyperparameters Specified\" e.g. whether or not the authors of the paper had precisely specified the hyperparameter that was used to run the experiments. It should come as no surprise that this can be a determining factor for reproducibility, however it is not given that hyperparameters are always well specified.
"},{"location":"s3_reproducibility/config_files/#configure-experiments","title":"Configure experiments","text":"There is really no way around it: deep learning contains a lot of hyperparameters. In general, a hyperparameter is any parameter that affects the learning process (e.g. the weights of a neural network are not hyperparameters because they are a consequence of the learning process). The problem with having many hyperparameters to control in your code, is that if you are not careful and structure them it may be hard after running a experiment to figure out which hyperparameters were actually used. Lack of proper configuration management can cause serious problems with reliability, uptime, and the ability to scale a system.
One of the most basic ways of structuring hyperparameters, is just to put them directly into you train.py
script in some object:
class my_hp:\n batch_size: 64\n lr: 128\n other_hp: 12345\n\n# easy access to them\ndl = DataLoader(Dataset, batch_size=my_hp.batch_size)\n
the problem here is configuration is not easy. Each time you want to run a new experiment, you basically have to change the script. If you run the code multiple times, without committing the changes in between then the exact hyperparameter configuration for some experiments may be lost. Alright, with this in mind you change strategy to use an argument parser e.g. run experiments like this
python train.py --batch_size 256 --learning_rate 1e-4 --other_hp 12345\n
This at least solves the problem with configurability. However, we again can end up with losing experiments if we are not careful.
What we really want is some way to easily configure our experiments where the hyperparameters are systematically saved with the experiment. For this we turn our attention to Hydra, a configuration tool that is based around writing config files to keep track of hyperparameters. Hydra operates on top of OmegaConf which is a yaml
based hierarchical configuration system.
A simple yaml
configuration file could look like
#config.yaml\nhyperparameters:\n batch_size: 64\n learning_rate: 1e-4\n
with the corresponding python code for loading the file
from omegaconf import OmegaConf\n# loading\nconfig = OmegaConf.load('config.yaml')\n\n# accessing in two different ways\ndl = DataLoader(dataset, batch_size=config.hyperparameters.batch_size)\noptimizer = torch.optim.Adam(model.parameters(), lr=config['hyperparameters']['lr'])\n
or using hydra
for loading the configuration
import hydra\n\n@hydra.main(config_name=\"basic.yaml\")\ndef main(cfg):\n print(cfg.hyperparameters.batch_size, cfg.hyperparameters.learning_rate)\n\nif __name__ == \"__main__\":\n main()\n
The idea behind refactoring our hyperparameters into .yaml
files is that we disentangle the model configuration from the model. In this way it is easier to do version control of the configuration because we have it in a separate file.
Exercise files
The main idea behind the exercises is to take a single script (that we provide) and use Hydra to make sure that everything gets correctly logged such that you would be able to exactly report to others how each experiment was configured. In the provided script, the hyperparameters are hardcoded into the code and your job will be to separate them out into a configuration file.
Note that we provide a solution (in the vae_solution
folder) that can help you get through the exercise, but try to look online for your answers before looking at the solution. Remember: its not about the result, its about the journey.
Start by install hydra: pip install hydra-core --upgrade
Next take a look at the vae_mnist.py
and model.py
file and understand what is going on. It is a model we will revisit during the course.
Identify the key hyperparameters of the script. Some of them should be easy to find, but at least 3 have made it into the core part of the code. One essential hyperparameter is also not included in the script but is needed to be completely reproducible (HINT: the weights of any neural network are initialized at random).
Write a configuration file config.yaml
where you write down the hyperparameters that you have found
Get the script running by loading the configuration file inside your script (using hydra) that incorporates the hyperparameters into the script. Note: you should only edit the vae_mnist.py
file and not the model.py
file.
Run the script
By default hydra will write the results to a outputs
folder, with a sub-folder for the day the experiment was run and further the time it was started. Inspect your run by going over each file the hydra has generated and check the information has been logged. Can you find the hyperparameters?
Hydra also allows for dynamically changing and adding parameters on the fly from the command-line:
Try changing one parameter from the command-line
python vae_mnist.py hyperparameters.seed=1234\n
Try adding one parameter from the command-line
python vae_mnist.py +experiment.stuff_that_i_want_to_add=42\n
By default the file vae_mnist.log
should be empty, meaning that whatever you printed to the terminal did not get picked up by Hydra. This is due to Hydra under the hood making use of the native python logging package. This means that to also save all printed output from the script we need to convert all calls to print
with log.info
Create a logger in the script:
import logging\nlog = logging.getLogger(__name__)\n
Exchange all calls to print
with calls to log.info
Try re-running the script and make sure that the output printed to the terminal also gets saved to the vae_mnist.log
file
Make sure that your script is fully reproducible. To check this you will need two runs of the script to compare. Then run the reproducibility_tester.py
script as
python reproducibility_tester.py path/to/run/1 path/to/run/2\n
the script will go over trained weights to see if the match and that the hyperparameters was the same. Note: for the script to work, the weights should be saved to a file called trained_model.pt
(this is the default of the vae_mnist.py
script, so only relevant if you have changed the saving of the weights)
Finally, make a new experiment using a new configuration file where you have changed a hyperparameter of your own choice. You are not allowed to change the configuration file in the script but should instead be able to provide it as an argument when launching the script e.g. something like
python vae_mnist.py experiment=exp2\n
We recommend that you use a file structure like this
|--conf\n| |--config.yaml\n| |--experiments\n| |--exp1.yaml\n| |--exp2.yaml\n|--my_app.py\n
Make your MNIST code reproducible! Apply what you have just done to the simple script to your MNIST code. Only requirement is that you this time use multiple configuration files, meaning that you should have at least one model_conf.yaml
file and a training_conf.yaml
file that separates out the hyperparameters that have to do with the model definition and those that have to do with the training. You can also choose to work with even more complex config setups: in the image below the configuration has two layers such that we individually can specify hyperparameters belonging to a specific model architecture and hyperparameters for each individual optimizer that we may try.
Image credit"},{"location":"s3_reproducibility/docker/","title":"M9 - Docker","text":""},{"location":"s3_reproducibility/docker/#docker","title":"Docker","text":"
Core Module
Image creditWhile the above picture may seem silly at first, it is actually pretty close to how docker came to existence. A big part of creating a MLOps pipeline, is that you are able to reproduce it. Reproducibility goes beyond versioning our code with git
and using conda
environment to keep track of our python installations. To really get reproducibility we need to also capture also system level components like
Docker provides this kind of system-level reproducibility by creating isolated programs dependencies. In addition to docker providing reproducibility, one of the key features are also scalability which is important when we later on are going to discuss deployment. Because docker is system-level reproducible, it does not (conceptually) matter if we try to start our program on a single machine or a 1000 machines at once.
"},{"location":"s3_reproducibility/docker/#docker-overview","title":"Docker overview","text":"Docker has three main concepts: docker file, docker image and docker container:
A docker file is a basic text document that contains all the commands a user could call on the command line to run an application. This includes installing dependencies, pulling data from online storage, setting up code and what commands that you want to run (e.g. python train.py
)
Running, or more correctly building a docker file will create a docker image. An image is a lightweight, standalone/containerized, executable package of software that includes everything (application code, libraries, tools, dependencies etc.) necessary to make an application run.
Actually running an image will create a docker container. This means that the same image can be launched multiple times, creating multiple containers.
The exercises today will focus on how to construct the actual docker file, as this is the first step to constructing your own container.
"},{"location":"s3_reproducibility/docker/#docker-sharing","title":"Docker sharing","text":"The whole point of using docker is that sharing applications becomes much easier. In general, we have two options
After creating the Dockerfile
we can simply commit it to github (its just a text file) and then ask other users to simply build the image by themselves.
After building the image ourself, we can choose to upload it to a image registry such as Docker Hub where other can get our image by simply running docker pull
, making them able to instantaneous running it as a container, as shown in the figure below
In the following exercises we guide you how to build a docker file for your MNIST repository that will make the training and prediction a self contained application. Please make sure that you somewhat understand each step and do not just copy of the exercise. Also note that you probably need to execute the exercise from an elevated terminal e.g. with administrative privilege.
The exercises today are only an introduction to docker and some of the steps are going to be unoptimized from a production setting view. For example we often want to keep the size of docker image as small as possible, which we are not focusing on for these exercises.
If you are using VScode
then we recommend install the docker VScode extension for easy getting an overview of which images have been build and which are running. Additionally the extension named Dev Containers may also be beneficial for you to download.
Start by installing docker. How much trouble that you need to go through depends on your operating system. For Windows and Mac we recommend they install Docker desktop, which comes with a graphical user interface (GUI) for quickly viewing docker images and docker containers currently build/in-use. Windows users that have not installed WSL yet are going to have to do it now (as docker need it as backend for starting virtual machines) but you do not need to install docker in WSL. After installing docker we recommend that you restart you laptop.
Try running the following to confirm that your installation is working:
docker run hello-world\n
which should give the message
Hello from Docker!\nThis message shows that your installation appears to be working correctly.\n
Next lets try to download a image from docker hub. Download the busybox
image:
docker pull busybox\n
which is an very small (1-5Mb) containerized application that contains the most essential GNU fileutils, shellutils etc.
After pulling the image, write
docker images\n
which should show you all images that are available. You should see the busybox
image that we just downloaded.
Lets try to run this image
docker run busybox\n
you will get that nothing happens! The reason for that is we did that not provide any commands to docker run
. We essentially just ask it to start the busybox
virtual machine, do nothing and then close it again. Now, try again this time with
docker run busybox echo \"hello from busybox\"\n
Note how fast this process is. In just a few seconds, Docker is able to start a virtual machine, execute a command and kill it afterwards.
Try running
docker ps\n
what does this command do? What if you add -a
to the end?
If we wanted to run multiple commands within the virtual machine, we can start it in interactive mode
docker run -it busybox\n
this can be a great way to investigate what the filesystem of our virtual machine looks like.
As you may have already noticed by now, each time we execute docker run
we can still see small remnants of the containers using docker ps -a
. These stray containers can end up take a lot of disk space. To remove them, use docker rm
where you provide the container id that you want to delete
docker rm <container_id>\n
Lets, now move on to trying to construct an docker file ourself for our MNIST project. Create a file called trainer.dockerfile
. The intention is that we want to develop one dockerfile for running our training script and one for doing predictions.
Instead of starting from scratch we nearly always want to start from some base image. For this exercise we are going to start from a simple python
image. Add the following to your Dockerfile
# Base image\nFROM python:3.9-slim\n
Next we are going to install some essentials in our image. The essentials more or less consist of a python installation. These instructions may seem familiar if you are using linux:
# install python\nRUN apt update && \\\n apt install --no-install-recommends -y build-essential gcc && \\\n apt clean && rm -rf /var/lib/apt/lists/*\n
The previous two steps are common for any docker application where you want to run python. All the remaining steps are application specific (to some degree):
Lets copy over our application (the essential parts) from our computer to the container:
COPY requirements.txt requirements.txt\nCOPY pyproject.toml pyproject.toml\nCOPY <project-name>/ <project-name>/\nCOPY data/ data/\n
Remember that we only want the essential parts to keep our docker image as small as possible. Why do we need each of these files/folders to run training in our docker container?
Lets set the working directory in our container and add commands that install the dependencies (1):
We split the the installation into two steps, such that docker can cache our project dependencies separately from our application code. This means that if we change our application code, we do not need to reinstall all the dependencies. This is a common strategy for docker images.
As an alternative you can use RUN make requirements
if you have a Makefile
that installs the dependencies. Just remember to also copy over the Makefile
into the docker image.
WORKDIR /\nRUN pip install -r requirements.txt --no-cache-dir\nRUN pip install . --no-deps --no-cache-dir\n
the --no-cache-dir
is quite important. Can you explain what it does and why it is important in relation to docker.
Finally, we are going to name our training script as the entrypoint for our docker image. The entrypoint is the application that we want to run when the image is being executed:
ENTRYPOINT [\"python\", \"-u\", \"<project_name>/train_model.py\"]\n
the \"u\"
here makes sure that any output from our script e.g. any print(...)
statements gets redirected to our terminal. If not included you would need to use docker logs
to inspect your run.
We are now ready to building our docker file into a docker image
docker build -f trainer.dockerfile . -t trainer:latest\n
MAC M1/M2 users In general docker images are build for a specific platform. For example, if you are using a Mac with a M1/M2 chip then you are running on a ARM architecture. If you are using a Windows or Linux machine then you are running on a AMD64 architecture. This is important to know when building docker images. Thus, docker images you build may not work on other platforms than the one you build it on. You can specify which platform you want to build for by adding the --platform
argument to the docker build
command:
docker build --platform linux/amd64 -f train.dockerfile . -t trainer:latest\n
and also when running the image:
docker run --platform linux/amd64 trainer:latest\n
Do not that this will significantly increase the build and run time of your docker image when running locally, because docker will need to emulate the other platform. In general for the exercises today, you should not need to specify the platform, but be aware of this if you are building docker images on your own.
please note here we are providing two extra arguments to docker build
. The -f train.dockerfile .
(the dot is important to remember) indicates which dockerfile that we want to run (except if you named it just Dockerfile
) and the -t trainer:latest
is the respective name and tag that we see afterwards when running docker images
(see image below). Please note that building a docker image can take a couple of minutes.
Docker images and space
Docker images can take up a lot of space on your computer. Especially, the docker images we are trying to build because Pytorch is huge dependency. If you are running low on space, you can try to
docker system prune\n
alternatively you can manually delete images using docker rmi {image_name}:{image_tag}
.
Try running docker images
and confirm that you get output similar to the one above. If you succeeds with this, then try running the docker image
docker run --name experiment1 trainer:latest\n
you should hopefully see your training starting. Please note that we can start as many containers that we want at the same time by giving them all different names using the --name
tag.
You are most likely going to re-build your docker image multiple times, either due to an implementation error or the addition of new functionality. Therefore, instead of watching pip suffer through downloading torch
for the 20th time, you can reuse the cache from last time the docker image was build. To do this, replace the line in your dockerfile that installs your requirements with:
RUN --mount=type=cache,target=~/pip/.cache pip install -r requirements.txt --no-cache-dir\n
which mounts your local pip cache to the docker image. For building the image you need to have enabled the BuildKit feature. If you have docker version v23.0 or later (you can check this by running docker version
) then this is enabled by default. Else you need to enable it by setting the environment variable DOCKER_BUILDKIT=1
before building the image.
Try changing your dockerfile and re-building the image. You should see that the build process is much faster.
Remember, if you ever are in doubt how files are organized inside a docker image you always have the option to start the image in interactive mode:
docker run -it --entrypoint sh {image_name}:{image_name}\n
When your training has completed you will notice that any files that is created when running your training script is not present on your laptop (for example if your script is saving the trained model to file). This is because the files were created inside your container (which is its own little machine). To get the files you have two options:
If you already have a completed run then you can use
docker cp\n
to copy the files between your container and laptop. For example to copy a file called trained_model.pt
from a folder you would do:
docker cp {container_name}:{dir_path}/{file_name} {local_dir_path}/{local_file_name}\n
Try this out.
A much more efficient strategy is to mount a volume that is shared between the host (your laptop) and the container. This can be done with the -v
option for the docker run
command. For example, if we want to automatically get the trained_model.pt
file after running our training script we could simply execute the container as
docker run --name {container_name} -v %cd%/models:/models/ trainer:latest\n
this command mounts our local models
folder as a corresponding models
folder in the container. Any file save by the container to this folder will be synchronized back to our host machine. Try this out! Note if you have multiple files/folders that you want to mount (if in doubt about file organization in the container try to do the next exercise first). Also note that the %cd%
need to change depending on your OS, see this page for help.
With training done we also need to write an application for prediction. Create a new docker image called predict.dockerfile
. This file should call your <project_name>/models/predict_model.py
script instead. This image will need some trained model weights to work. Feel free to either includes these during the build process or mount them afterwards. When you created the file try to build
and run
it to confirm that it works. Hint: if you are passing in the model checkpoint and prediction data as arguments to your script, your docker run
probably need to look something like
docker run --name predict --rm \\\n -v %cd%/trained_model.pt:/models/trained_model.pt \\ # mount trained model file\n -v %cd%/data/example_images.npy:/example_images.npy \\ # mount data we want to predict on\n predict:latest \\\n ../../models/trained_model.pt \\ # argument to script, path relative to script location in container\n ../../example_images.npy\n
(Optional, requires GPU support) By default a virtual machine created by docker only have access to your cpu
and not your gpu
. While you do not necessarily have a laptop with a GPU that supports training of neural network (e.g. one from Nvidia) it is beneficial that you understand how to construct a docker image that can take advantage of a GPU if you were to run this on a machine in the future that have a GPU (e.g. in the cloud). It does take a bit more work, but many of the steps will be similar to building a normal docker image.
There are three prerequisites for working with Nvidia GPU accelerated docker containers. First you need to have the Docker Engine installed (already taken care of), have Nvidia GPU with updated GPU drivers and finally have the Nvidia container toolkit installed. The last part you not likely have not installed and needs to do. Some distros of Linux have known problems with the installation process, so you may have to search through known issues in nvidia-docker repository to find a solution
To test that everything is working start by pulling a relevant Nvidia docker image. In my case this is the correct image:
docker pull nvidia/cuda:11.0.3-base-ubuntu20.04\n
but it may differ based on what cuda vision you have. You can find all the different official Nvidia images here. After pulling the image, try running the nvidia-smi
command inside a container based on the image you just pulled. It should look something like this:
docker run --rm --gpus all nvidia/cuda:11.0.3-base-ubuntu20.04 nvidia-smi\n
and should show an image like below:
If it does not work, try redoing the steps.
We should hopefully have a working setup now for running Nvidia accelerated docker containers. Next step is to get Pytorch inside of our container, such that our Pytorch implementation also correctly identify the GPU. Luckily for us Nvidia provides a set of docker images for GPU-optimized software for AI, HPC and visualizations through their NGC Catalog. The containers that have to do with Pytorch can be seen here. Try pulling the latest:
docker pull nvcr.io/nvidia/pytorch:22.07-py3\n
It may take some time, because the NGC images includes a lot of other software for optimizing Pytorch applications. It may be possible for you to find other images for running GPU accelerated applications that have a smaller memory footprint, but NGC are the recommend and supported way.
Lets test that this container work:
docker run --gpus all -it --rm nvcr.io/nvidia/pytorch:22.07-py3\n
this should run the container in interactive mode attached to your current terminal. Try opening python
in the container and try writing:
import torch\nprint(torch.cuda.is_available())\n
which hopefully should return True
.
Finally, we need to incorporate all this into our already developed docker files for our application. This is also fairly easy as we just need to change our FROM
statement in the beginning of our docker file:
FROM python:3.7-slim\n
change to
FROM nvcr.io/nvidia/pytorch:22.07-py3\n
try doing this to one of your docker files, build the image and run the container. Remember to check that your application is using GPU by printing torch.cuda.is_available()
.
(Optional) Another way you can use dockerfiles in your day to day work is for Dev-containers. Developer containers allows you to develop code directly inside a container, making sure that your code is running in the same environment as it will when deployed. This is especially useful if you are working on a project that has a lot of dependencies that are hard to install on your local machine. Setup instructions for VS code and Pycharm can be found here (should be simple since we have already installed docker):
We focus on the VS code setup here.
First install the Remote - Containers extension.
Create a .devcontainer
folder in your project root and create a Dockerfile
inside it. We keep this file very barebone for now, so lets just define a base installation of python:
FROM python:3.11-slim-buster\n\nRUN apt update && \\\n apt install --no-install-recommends -y build-essential gcc && \\\n apt clean && rm -rf /var/lib/apt/lists/*\n
Create a devcontainer.json
file in the .devcontainer
folder. This file should look something like this:
{\n \"name\": \"my_working_env\",\n \"dockerFile\": \"Dockerfile\",\n \"postCreateCommand\": \"pip install -r requirements.txt\"\n}\n
this file tells VS code that we want to use the Dockerfile
that we just created and that we want to install our python dependencies after the container has been created.
After creating these files, you should be able to open the command palette in VS code (F1) and search for the option Remote-Containers: Reopen in Container
or Remote-Containers: Rebuild and Reopen in Container
. Choose either of these options.
This will start a new VS code instance inside a docker container. You should be able to see this in the bottom left corner of your VS code window. You should also be able to see that the python interpreter has changed to the one inside the container.
You are now ready to start developing inside the container. Try opening a terminal and run python
and import torch
to confirm that everything is working.
(Optional) In M8 on Data version control you learned about the framework dvc
for version controlling data. A neutral question at this point would then be how to incorporate dvc
into our docker image. We need to do two things:
dvc
have all the correct files to pull data from our remote storagedvc
have the correct credentials to pull data from our remote storageWe are going to assume that dvc
(and any dvc
extension needed) is part of your requirement.txt
file and that it is already being installed in a RUN pip install -r requirements.txt
command in your dockerfile. If not, then you need to add it.
Add the following lines to your dockerfile
RUN dvc init --no-scm\nCOPY .dvc/config .dvc/config\nCOPY *.dvc *.dvc\nRUN dvc config core.no_scm true\nRUN dvc pull\n
The first line initialize dvc
in the docker image. The --no-scm
option is needed because normally dvc
can only be initialized inside a git repository, but this option allows to initialize dvc
without being in one. The second and third line copies over the dvc
config file and the dvc
metadate files that are needed to pull data from your remote storage. The last line pulls the data.
If your data is not public, we need to provide credentials in some way to pull the data. We are for now going to do it in a not-so-secure way. When dvc
first connected to your drive a credential file was created. This file is located in $CACHE_HOME/pydrive2fs/{gdrive_client_id}/default.json
where $CACHE_HOME
.
~/Library/Caches
~/.cache
This is the typical location, but it may vary depending on what distro you are running
{user}/AppData/Local
Find the file. The content should look similar to this (only some fields are shown):
{\n \"access_token\": ...,\n \"client_id\": ...,\n \"client_secret\": ...,\n \"refresh_token\": ...,\n ...\n}\n
We are going to copy the file into our docker image. This of course is not a secure way of doing it, but it is the easiest way to get started. As long as you are not sharing your docker image with anyone else, then it is fine. Add the following lines to your dockerfile before the RUN dvc pull
command:
```dockerfile COPY default.json dvc remote modify myremote --local gdrive_service_account_json_file_path default.json ````
where <path_to_default.json>
is the path to the default.json
file that you just found. The last line tells dvc
to use the default.json
file as the credentials for pulling data from your remote storage. You can confirm that this works by running dvc pull
in your docker image.
What is the difference between a docker image and a docker container?
SolutionA docker image is a template for a docker container. A docker container is a running instance of a docker image. A docker image is a static file, while a docker container is a running process.
What are the 3 steps involved in containerizing an application?
SolutionWhat advantage is there to running your application inside a docker container instead of running the application directly on your machine?
SolutionRunning inside a docker container gives you a consistent and independent environment for your application. This means that you can be sure that your application will run the same way on your machine as it will on another machine. Thus, docker gives the ability to abstract away the differences between different machines.
A docker container is build from a series of layers that are stacked on top of each others. This should be clear if you look at the output when building a docker image. What is the advantage of this?
SolutionThe advantage is efficiency and reusability. When a change is made to a docker image, only the layer(s) that are changed needs to be updated. For example, if you update the application code in your docker image, which usually is the last layer, then only that layer needs to be rebuild, making the process much faster. Additionally, if you have multiple docker images that share the same base image, then the base image only needs to be downloaded once.
The covers the absolute minimum you should know about docker to get a working image and container. If you want to really deep dive into this topic you can find a copy of the Docker Cookbook by S\u00e9bastien Goasguen in the literature folder.
If you are actively going to be using docker in the near future, one thing to consider is the image size. Even these simple images that we have build still takes up GB in size. A number of optimizations steps can be taken to reduce the image size for you or your end user. If you have time you can read this article on different approaches to reduce image size. Additionally, you can take a look at the dive-in extension for docker desktop that lets you explore in depth your docker images.
"},{"location":"s4_debugging_and_logging/","title":"Debugging, Profiling, Logging and Boilerplate","text":"Slides
Today we are initially going to go over three different topics that are all fundamentally necessary for any data scientist or DevOps engineer:
All three topics can be characterized by something you probably already are familiar with. Since you started programming, you have done debugging as nobody can write perfect code in the first try. Similarly, while you have not directly profiled your code, I bet that you at some point have had some very slow code and optimized it to run faster. Identifying and improving is the fundamentals of profiling code. Finally, logging is a very broad term and basically refers to any kind of output from your applications that help you at a later point identify the \"performance\" of you application.
However, while we expect you to already be familiar with these topics, we do not expect all of you to be expects in this as it is very rarely topics that are focused on. Today we are going to introduce some best practices and tools to help you overcome each and everyone of these three important topics.
As the final topic for today we are going to learn about how we can minimize boilerplate and focus on coding what actually matters for our project instead of all the boilerplate to get it working.
Learning objectives
The learning objectives of this session are:
pytorch-lightning
framework to minimize boilerplate code and structure deep learning modelsBoilerplate is a general term that describes any standardized text, copy, documents, methods, or procedures that may be used over again without making major changes to the original. But how does this relate to doing machine learning projects? If you have already tried doing a couple of projects within machine learning you will probably have seen a pattern: every project usually consist of these three aspects of code:
While the latter two certainly seems important, in most cases the actual development or research often revolves around defining the model. In this sense, both the training code and the utilities becomes boilerplate that should just carry over from one project to another. But the problem usually is that we have not generalized our training code to take care of the small adjusted that may be required in future projects and we therefore end up implementing it over and over again every time that we start a new project. This is of course a waste of our time that we should try to find a solution to.
This is where high-level frameworks comes into play. High-level frameworks are build on top of another framework (Pytorch in this case) and tries to abstract/standardize how to do particular tasks such as training. At first it may seem irritating that you need to comply to someone else code structure, however there is a good reason for that. The idea is that you can focus on what really matters (your task, model architecture etc.) and do not have to worry about the actual boilerplate that comes with it.
The most popular high-level (training) frameworks within the Pytorch
ecosystem are:
They all offer many of the same features, so choosing one over the other for most projects should not matter that much. We are here going to use Pytorch Lightning
, as it offers all the functionality that we are going to need later in the course.
In general we refer to the documentation from Pytorch lightning if in doubt about how to format your code for doing specific tasks. We are here going to explain the key concepts of the API that you need to understand to use the framework, starting with the LightningModule
and the Trainer
.
The LightningModule
is a subclass of a standard nn.Module
that basically adds additional structure. In addition to the standard __init__
and forward
methods that need to be implemented in a nn.Module
, a LightningModule
further requires two more methods implemented:
training_step
: should contain your actual training code e.g. given a batch of data this should return the loss that you want to optimize
configure_optimizers
: should return the optimizer that you want to use
Below is shown these two methods added to standard MNIST classifier
Compared to a standard nn.Module
, the additional methods in the LightningModule
basically specifies exactly how you want to optimize your model.
The second component to lightning is the Trainer
object. As the name suggest, the `Trainer object takes care of the actual training, automizing everything that you do not want to worry about.
from pytorch_lightning import Trainer\nmodel = MyAwesomeModel() # this is our LightningModule\ntrainer = Trainer()\ntraier.fit(model)\n
That's is essentially all that you need to specify in lightning to have a working model. The trainer object does not have methods that you need to implement yourself, but it have a bunch of arguments that can be used to control how many epochs that you want to train, if you want to run on gpu etc. To get the training of our model to work we just need to specify how our data should be feed into the lighning framework.
"},{"location":"s4_debugging_and_logging/boilerplate/#data","title":"Data","text":"For organizing our code that has to do with data in Lightning
we essentially have three different options. However, all three assume that we are using torch.utils.data.DataLoader
for the dataloading.
If we already have a train_dataloader
and possible also a val_dataloader
and test_dataloader
defined we can simply add them to our LightningModule
using the similar named methods:
def train_dataloader(self):\n return DataLoader(...)\n\ndef val_dataloader(self):\n return DataLoader(...)\n\ndef test_dataloader(self):\n return DataLoader(...)\n
Maybe even simpler, we can directly feed such dataloaders in the fit
method of the Trainer
object:
trainer.fit(model, train_dataloader, val_dataloader)\ntrainer.test(model, test_dataloader)\n
Finally, Lightning
also have the LightningDataModule
that organizes data loading into a single structure, see this page for more info. Putting data loading into a DataModule
makes sense as it is then can be reused between projects.
Callbacks is one way to add additional functionality to your model, that strictly speaking is not already part of your model. Callbacks should therefore be seen as self-contained feature that can be reused between projects. You have the option to implement callbacks yourself (by inheriting from the pytorch_lightning.callbacks.Callback
base class) or use one of the build in callbacks. Of particular interest are ModelCheckpoint
and EarlyStopping
callbacks:
The ModelCheckpoint
makes sure to save checkpoints of you model. This is in principal not hard to do yourself, but the ModelCheckpoint
callback offers additional functionality by saving checkpoints only when some metric improves, or only save the best K
performing models etc.
model = MyModel()\ncheckpoint_callback = ModelCheckpoint(\n dirpath=\"./models\", monitor=\"val_loss\", mode=\"min\"\n)\ntrainer = Trainer(callbacks=[checkpoint_callbacks])\ntrainer.fit(model)\n
The EarlyStopping
callback can help you prevent overfitting by automatically stopping the training if a certain value is not improving anymore:
model = MyModel()\nearly_stopping_callback = EarlyStopping(\n monitor=\"val_loss\", patience=3, verbose=True, mode=\"min\"\n)\ntrainer = Trainer(callbacks=[early_stopping_callback])\ntrainer.fit(model)\n
Multiple callbacks can be used by passing them all in a list e.g.
trainer = Trainer(callbacks=[checkpoint_callbacks, early_stopping_callback])\n
"},{"location":"s4_debugging_and_logging/boilerplate/#exercises","title":"\u2754 Exercises","text":"Please note that the in following exercise we will basically ask you to reformat all your MNIST code to follow the lightning standard, such that we can take advantage of all the tricks the framework has to offer. The reason we did not implement our model in lightning
to begin with, is that to truly understand why it is beneficially to use a high-level framework to do some of the heavy lifting you need to have gone through some of implementation troubles yourself.
Install pytorch lightning:
pip install pytorch-lightning # (1)!\n
pip install lightning
which includes more than just the Pytorch Lightning
package. This also includes Lightning Fabric
and Lightning Apps
which you can read more about here and here.Convert your corrupted MNIST model into a LightningModule
. You can either choose to completely override your old model or implement it in a new file. The bare minimum that you need to add while converting to get it working with the rest of lightning:
The training_step
method. This function should contain essentially what goes into a single training step and should return the loss at the end
The configure_optimizers
method
Please read the documentation for more info.
Make sure your data is formatted such that it can be loaded using the torch.utils.data.DataLoader
object.
Instantiate a Trainer
object. It is recommended to take a look at the trainer arguments (there are many of them) and maybe adjust some of them:
Investigate what the default_root_dir
flag does
As default lightning will run for 1000 epochs. This may be too much (for now). Change this by changing the appropriate flag. Additionally, there also exist a flag to set the maximum number of steps that we should train for.
To start with we also want to limit the amount of training data to 20% of its original size. which trainer flag do you need to set for this to work?
Try fitting your model: trainer.fit(model)
Now try adding some callbacks
to your trainer.
The privous module was all about logging in wandb
, so the question is naturally how does lightning
support this. Lightning does not only support wandb
, but also many others. Common for all of them, is that logging just need to happen through the self.log
method in your LightningModule
:
Add self.log
to your `LightningModule. Should look something like this:
def training_step(self, batch, batch_idx):\n data, target = batch\n preds = self(data)\n loss = self.criterion(preds, target)\n acc = (target == preds.argmax(dim=-1)).float().mean()\n self.log('train_loss', loss)\n self.log('train_acc', acc)\n return loss\n
Add the wandb
logger to your trainer
trainer = Trainer(logger=pl.loggers.WandbLogger(project=\"dtu_mlops\"))\n
and try to train the model. Confirm that you are seeing the scalars appearing in your wandb
portal.
self.log
does sadly only support logging scalar tensors. Luckily, for logging other quantities we can still access the standard wandb.log
through our model
def training_step(self, batch, batch_idx):\n ...\n # self.logger.experiment is the same as wandb.log\n self.logger.experiment.log({'logits': wandb.Histrogram(preds)})\n
try doing this, by logging something else than scalar tensors.
Finally, we maybe also want to do some validation or testing. In lightning we just need to add the validation_step
and test_step
to our lightning module and supply the respective data in form of a separate dataloader. Try to at least implement one of them.
(Optional, requires GPU) One of the big advantages of using lightning
is that you no more need to deal with device placement e.g. called .to('cuda')
everywhere. If you have a GPU, try to set the gpus
flag in the trainer. If you do not have one, do not worry, we are going to return to this when we are going to run training in the cloud.
(Optional) As default Pytorch uses float32
for representing floating point numbers. However, research have shown that neural network training is very robust towards a decrease in precision. The great benefit going from float32
to float16
is that we get approximately half the memory consumption. Try out half-precision training in Pytorch lightning. You can enable this by setting the precision flag in the Trainer
.
(Optional) Lightning also have built-in support for profiling. Checkout how to do this using the profiler argument in the Trainer
object.
(Optional) Another great feature of Lightning is that the allow for easily defining command line interfaces through the Lightning CLI feature. The Lightning CLI is essentially a drop in replacement for defining command line interfaces (covered in this module) and can also replace the need for config files (covered in this module) for securing reproducibility when working inside the Lightning framework. We highly recommend checking out the feature and that you try to refactor your code such that you do not need to call trainer.fit
anymore but it is instead directly controlled from the Lightning CLI.
Free exercise: Experiment with what the lightning framework is capable of. Either try out more of the trainer flags, some of the other callbacks, or maybe look into some of the other methods that can be implemented in your lightning module. Only your imagination is the limit!
That covers everything for today. It has been a mix of topics that all should help you write \"better\" code (by some objective measure). If you want to deep dive more into the Pytorch lightning framework, we highly recommend looking at the different tutorials in the documentation that covers more advanced models and training cases. Additionally, we also want to highlight other frameworks in the lightning ecosystem:
Debugging is very hard to teach and is one of the skills that just comes with experience. That said, there are good and bad ways to debug a program. We are all probably familiar with just inserting print(...)
statements everywhere in our code. It is easy and can many times help narrow down where the problem happens. That said, this is not a great way of debugging when dealing with a very large codebase. You should therefore familiarize yourself with the built-in python debugger as it may come in handy during the course.
To invoke the build in python debugger you can either:
Set a trace directly with the python debugger by calling
import pdb\npdb.set_trace()\n
anywhere you want to stop the code. Then you can use different commands (see the python_debugger_cheatsheet.pdf
) to step through the code.
If you are using an editor, then you can insert inline breakpoints (in VS code this can be done by pressing F9
) and then execute the script in debug mode (inline breakpoints can often be seen as small red dots to the left of your code). The editor should then offer some interface to allow you step through your code. Here is a guide to using the build in debugger in VScode.
Additionally, if your program is stopping on an error and you automatically want to start the debugger where it happens, then you can simply launch the program like this from the terminal
python -m pdb -c continue my_script.py\n
Exercise files
We here provide a script vae_mnist_bugs.py
which contains a number of bugs to get it running. Start by going over the script and try to understand what is going on. Hereafter, try to get it running by solving the bugs. The following bugs exist in the script:
Some of the bugs prevents the script from even running, while some of them influences the training dynamics. Try to find them all. We also provide a working version called vae_mnist_working.py
(but please try to find the bugs before looking at the script). Successfully debugging and running the script should produce three files:
orig_data.png
containing images from the standard MNIST training setreconstructions.png
reconstructions from the modelgenerated_samples.png
samples from the modelAgain, we cannot stress enough that the exercise is actually not about finding the bugs but using a proper debugger to find them.
"},{"location":"s4_debugging_and_logging/logging/","title":"M13 - Logging","text":""},{"location":"s4_debugging_and_logging/logging/#logging","title":"Logging","text":"Core Module
Logging in general refers to the practise of recording events activities over time. Having proper logging in your applications can be extremely beneficial for a few reasons:
Debugging becomes easier because we in a more structure way can output information about the state of our program, variables, values etc. to help identify and fix bugs or unexpected behavior.
When we move into a more production environment, proper logging is essential for monitoring the health and performance of our application.
It can help in auditing as logging info about specific activities etc. can help keeping a record of who did what and when.
Having proper logging means that info is saved for later, that can be analysed to gain insight into the behavior of our application, such as trends.
We are in this course going to divide the kind of logging we can do into categories: application logging and experiment logging. In general application logging is important regardless of the kind of application you are developing, whereas experiment logging is important machine learning based projects where we are doing experiments.
"},{"location":"s4_debugging_and_logging/logging/#application-logging","title":"Application logging","text":"The most basic form of logging in Python applications is the good old print
statement:
for batch_idx, batch in enumerate(dataloader):\n print(f\"Processing batch {batch_idx} out of {len(dataloader)}\")\n ...\n
This will keep a \"record\" of the events happening in our script, in this case how far we have progressed. We could even change the print to include something like batch.shape
to also have information about the current data being processed.
Using print
statements is fine for small applications, but to have proper logging we need a bit more functionality than what print
can offer. Python actually comes with a great logging module, that defines functions for flexible logging. It is exactly this we are going to look at in this module.
The four main components to the Python logging module are:
Logger: The main entry point for using the logging system. You create instances of the Logger class to emit log messages.
Handler: Defines where the log messages go. Handlers send the log messages to specific destinations, such as the console or a file.
Formatter: Specifies the layout of the log messages. Formatters determine the structure of the log records, including details like timestamps and log message content.
Level: Specifies the severity of a log message.
Especially, the last point is important to understand. Levels essentially allows of to get rid of statements like this:
if debug:\n print(x.shape)\n
where the logging is conditional on the variable debug
which we can set a runtime. Thus, it is something we can disable for users of our application (debug=False
) but have enabled when we develop the application (debug=True
). And it makes sense that not all things logged, should be available to all stakeholders of a codebase. We as developers probably always wants the highest level of logging, whereas users of the our code need less info and we may want to differentiate this based on users.
It is also important to understand the different between logging and error handling. Error handling Python is done using raise
statements and try/catch
like:
def f(x: int):\n if not isinstance(x, int):\n raise ValueError(\"Expected an integer\")\n return 2 * x\n\ntry:\n f(5):\nexcept ValueError:\n print(\"I failed to do a thing, but continuing.\")\n
Why would we evere need log warning
, error
, critical
levels of information, if we are just going to handle it? The reason is that raising exceptions are meant to change the program flow at runtime e.g. things we do not want the user to do, but we can deal with in some way. Logging is always for after a program have run, to inspect what went wrong. Sometimes you need one, sometimes the other, sometimes both.
Exercises are inspired by this made with ml module on the same topic. If you need help for the exercises you can find a simple solution script here.
As logging is a built-in module in Python, nothing needs to be installed. Instead start a new file called my_logger.py
and start out with the following code:
import logging\nimport sys\n\n# Create super basic logger\nlogging.basicConfig(stream=sys.stdout, level=logging.DEBUG)\nlogger = logging.getLogger(__name__) # (1)\n\n# Logging levels (from lowest to highest priority)\nlogger.debug(\"Used for debugging your code.\")\nlogger.info(\"Informative messages from your code.\")\nlogger.warning(\"Everything works but there is something to be aware of.\")\nlogger.error(\"There's been a mistake with the process.\")\nlogger.critical(\"There is something terribly wrong and process may terminate.\")\n
__name__
always contains the record of the script or module that is currently being run. Therefore if we initialize our logger base using this variable, it will always be unique to our application and not conflict with logger setup by any third-party package.Try running the code. Than try changing the argument level
when creating the logger. What happens when you do that?
Instead of sending logs to the terminal, we may also want to send them to a file. This can be beneficial, such that only warning
level logs and higher are available to the user, but debug
and info
is still saved when the application is running.
Try adding the following dict to your logger.py
file:
logging_config = {\n \"version\": 1,\n \"formatters\": { # (1)\n \"minimal\": {\"format\": \"%(message)s\"},\n \"detailed\": {\n \"format\": \"%(levelname)s %(asctime)s [%(name)s:%(filename)s:%(funcName)s:%(lineno)d]\\n%(message)s\\n\"\n },\n },\n \"handlers\": { # (2)\n \"console\": {\n \"class\": \"logging.StreamHandler\",\n \"stream\": sys.stdout,\n \"formatter\": \"minimal\",\n \"level\": logging.DEBUG,\n },\n \"info\": {\n \"class\": \"logging.handlers.RotatingFileHandler\",\n \"filename\": Path(LOGS_DIR, \"info.log\"),\n \"maxBytes\": 10485760, # 1 MB\n \"backupCount\": 10,\n \"formatter\": \"detailed\",\n \"level\": logging.INFO,\n },\n \"error\": {\n \"class\": \"logging.handlers.RotatingFileHandler\",\n \"filename\": Path(LOGS_DIR, \"error.log\"),\n \"maxBytes\": 10485760, # 1 MB\n \"backupCount\": 10,\n \"formatter\": \"detailed\",\n \"level\": logging.ERROR,\n },\n },\n \"root\": {\n \"handlers\": [\"console\", \"info\", \"error\"],\n \"level\": logging.INFO,\n \"propagate\": True,\n },\n}\n
The formatter section determines how logs should be formatted. Here we define two separate formatters, called minimal
and detailed
which we can use in the next part of the code.
The handlers is in charge of what should happen to different level of logging. console
uses the minimal
format we defined and sens logs to the stdout
stream for messages of level DEBUG
and higher. The info
handler uses the detailed
format and sends messages of level INFO
and higher to a separate info.log
file. The error
handler does the same for messages of level ERROR
and higher to a file called error.log
.
you will need to set the LOGS_DIR
variable and also figure out how to add this logging_config
using the logging config submodule to your logger.
When the code successfully runs, check the LOGS_DIR
folder and make sure that a info.log
and error.log
file was created with the appropriate content.
Finally, lets try to add a little bit of style and color to our logging. For this we can use the great package rich which is a great package for rich text and beautiful formatting in terminals. Install rich
and add the following line to your my_logger.py
script:
logger.root.handlers[0] = RichHandler(markup=True) # set rich handler\n
and try re-running the script. Hopefully you should see something beautiful in your terminal like this:
(Optional) We already briefly touched on logging during the module on config files using hydra. If you want to configure hydra to use custom logging scheme as the one we setup in the last two exercises, you can take a look at this page. In hydra you will need to provide the configuration of the logger as config file. You can find examples of such config file here.
When most people think machine learning, we think about the training phase. Being able to track and log experiments is an important part of understanding what is going on with your model while you are training. It can help you debug your model and help tweak your models to perfection. Without proper logging of experiments, it can be really hard to iterate on the model because you do not know what changes lead to increase or decrease in performance.
The most basic logging we can do when running experiments is writing the metrics that our model is producing e.g. the loss or the accuracy to the terminal or a file for later inspection. We can then also use tools such as matplotlib for plotting the progression of our metrics over time. This kind of workflow may be enough when doing smaller experiments or working alone on a project, but there is no way around using a proper experiment tracker and visualizer when doing large scale experiments in collaboration with others. It especially becomes important when you want to compare performance between different runs.
There exist many tools for logging your experiments, with some of them being:
All of the frameworks offers many of the same functionalities, you can see a (bias) review here. We are going to use Weights and Bias (wandb), as it support everything we need in this course. Additionally, it is an excellent tool for collaboration and sharing of results.
Using the Weights and Bias (wandb) dashboard we can quickly get an overview and compare many runs over different metrics. This allows for better iteration of models and training procedure."},{"location":"s4_debugging_and_logging/logging/#exercises_1","title":"\u2754 Exercises","text":"Start by creating an account at wandb. I recommend using your github account but feel free to choose what you want. When you are logged in you should get an API key of length 40. Copy this for later use (HINT: if you forgot to copy the API key, you can find it under settings).
Next install wandb on your laptop
pip install wandb\n
Now connect to your wandb account
wandb login\n
you will be asked to provide the 40 length API key. The connection should be remain open to the wandb server even when you close the terminal, such that you do not have to login each time. If using wandb
in a notebook you need to manually close the connection using wandb.finish()
.
With it all setup we are now ready to incorporate wandb
into our code. The interface is fairly simple, and this guide should give enough hints to get you through the exercise. (HINT: the two methods you need to call are wandb.init
and wandb.log
). To start with, logging the training loss of your model will be enough.
After running your model, checkout the webpage. Hopefully you should be able to see at least one run with something logged.
Now log something else than scalar values. This could be a image, a histogram or a matplotlib figure. In all cases the logging is still going to use wandb.log
but you need extra calls to wandb.Image
etc. depending on what you choose to log.
Finally, lets create a report that you can share. Click the Create report button and include some of the graphs/plots/images that you have generated in the report.
To make sure that you have completed todays exercises, make the report shareable by clicking the Share button and create view-only-link. Send the link to my email nsde@dtu.dk
, so I can checkout your awesome work \ud83d\ude03
When calling wandb.init
you have two arguments called project
and entity
. Make sure that you understand these and try them out. It will come in handy for your group work as they essentially allows multiple users to upload their own runs to the same project in wandb
.
Wandb also comes with build in feature for doing hyperparameter sweeping which can be beneficial to get a better working model. Look through the documentation on how to do a hyperparameter sweep in Wandb. You at least need to create a new file called sweep.yaml
and make sure that you call wandb.log
in your code on an appropriate value. Note: if you want hydra
and wandb
to work together you will need to change the command
config in your sweep.yaml
file, see this page.
In the future it will be important for us to be able to run Wandb inside a docker container (together with whatever training or inference we specify). The problem here is that we cannot authenticate Wandb in the same way as the previous exercise, it needs to happen automatically. Lets therefore look into how we can do that.
First we need to generate an authentication key, or more precise an API key. This is in general the way any service (like a docker container) can authenticate. Start by going https://wandb.ai/home, click your profile icon in the upper right corner and then go to settings. Scroll down to the danger zone and generate a new API key and finally copy it.
Next create a new docker file called wandb.docker
and add the following code
FROM python:3.9\nRUN apt update && \\\n apt install --no-install-recommends -y build-essential gcc && \\\n apt clean && rm -rf /var/lib/apt/lists/*\nRUN pip install wandb\nCOPY s4_debugging_and_logging/exercise_files/wandb_tester.py wandb_tester.py\nENTRYPOINT [\"python\", \"-u\", \"wandb_tester.py\"]\n
please take a look at the script being copied into the image and afterwards build the docker image.
When we want to run the image, what we need to do is including a environment variables that contains the API key we generated. This will then authenticate the docker container with the wandb server:
docker run -e WANDB_API_KEY=<your-api-key> wandb:latest\n
Try running it an confirm that the results are uploaded to the wandb server.
Feel free to experiment more with wandb
as it is a great tool for logging, organizing and sharing experiments.
That is the module on logging. Please note that at this point in the course you will begin to see some overlap between the different frameworks. While we mainly used hydra
for configuring our python scripts it can also be used to save metrics and hyperparameters similar to how wandb
can. Similar arguments holds for dvc
which can also be used to log metrics. In our opinion wandb
just offers a better experience when interacting with the results after logging. We want to stress that the combination of tools presented in this course may not be the best for all your future projects, and we recommend finding a setup that fits you. That said, each framework provide specific features that the others does not.
Finally, we want to note that we during the course really try to showcase a lot of open source frameworks, Wandb is not one. It is free to use for personal usage (with a few restrictions) but for enterprise it does require a license. If you are eager to only work with open-source tools we highly recommend trying out MLFlow which offers the same overall functionalities as Wandb.
"},{"location":"s4_debugging_and_logging/profiling/","title":"M12 - Profiling","text":""},{"location":"s4_debugging_and_logging/profiling/#profilers","title":"Profilers","text":"Core Module
"},{"location":"s4_debugging_and_logging/profiling/#profilers_1","title":"Profilers","text":"In general profiling code is about improving the performance of your code. In this session we are going to take a somewhat narrow approach to what \"performance\" is: runtime, meaning the time it takes to execute your program.
At the bare minimum, the two questions a proper profiling of your program should be able to answer is:
The first question is important to priorities optimization. If two methods A
and B
have approximately the same runtime, but A
is called 1000 more times than B
we should probably spend time optimizing A
over B
if we want to speedup our code. The second question is gives itself, directly telling us which methods are the expensive to call.
Using profilers can help you find bottlenecks in your code. In this exercise we will look at two different profilers, with the first one being the cProfile. cProfile
is pythons build in profiler that can help give you an overview runtime of all the functions and methods involved in your programs.
Run the cProfile
on the vae_mnist_working.py
script. Hint: you can directly call the profiler on a script using the -m
arg
python -m cProfile -o <output_file> -s <sort_order> myscript.py\n
Try looking at the output of the profiling. Can you figure out which function took the longest to run?
Can you explain the difference between tottime
and cumtime
? Under what circumstances does these differ and when are they equal.
To get a better feeling of the profiled result we can try to visualize it. Python does not provide a native solution, but open-source solutions such as snakeviz exist. Try installing snakeviz
and load a profiled run into it (HINT: snakeviz expect the run to have the file format .prof
).
Try optimizing the run! (Hint: The data is not stored as torch tensor). After optimizing the code make sure (using cProfile
and snakeviz
) that the code actually runs faster.
Profiling machine learning code can become much more complex because we are suddenly beginning to mix different devices (CPU+GPU), that can (and should) overlap some of their computations. When profiling this kind of machine learning code we are often looking for bottlenecks. A bottleneck is simple the place in your code that is preventing other processes from performing their best. This is the reason that all major deep learning frameworks also include their own profilers that can help profiling more complex applications.
The image below show a typical report using the build in profiler in pytorch. As the image shows the profiler looks both a the kernel
time (this is the time spend doing actual computations) and also transfer times such as memcpy
(where we are copying data between devices). It can even analyze your code and give recommendations.
Using the profiler can be as simple as wrapping the code that you want to profile with the torch.profiler.profile
decorator
with torch.profiler.profile(...) as prof:\n # code that I want to profile\n output = model(data)\n
"},{"location":"s4_debugging_and_logging/profiling/#exercises_1","title":"\u2754 Exercises","text":"Exercise files
In these investigate the profiler that is build into PyTorch already. Note that these exercises requires that you have PyTorch v1.8.1 installed (or higher). You can always check which version you currently have installed by writing (in a python interpreter):
import torch\nprint(torch.__version__)\n
But we always recommend to update to the latest Pytorch version for the best experience. Additionally, to display the result nicely (like snakeviz
for cProfile
) we are also going to use the tensorboard profiler extension
pip install torch_tb_profiler\n
A good starting point is too look at the API for the profiler. Here the important class to look at is the torch.profiler.profile
class.
Lets try out an simple example (taken from here):
Try to run the following code
import torch\nimport torchvision.models as models\nfrom torch.profiler import profile, ProfilerActivity\n\nmodel = models.resnet18()\ninputs = torch.randn(5, 3, 224, 224)\n\nwith profile(activities=[ProfilerActivity.CPU], record_shapes=True) as prof:\n model(inputs)\n
this will profile the forward
pass of Resnet 18 model.
Running this code will produce an prof
object that contains all the relevant information about the profiling. Try writing the following code:
print(prof.key_averages().table(sort_by=\"cpu_time_total\", row_limit=10))\n
what operation is taking most of the cpu?
Try running
print(prof.key_averages(group_by_input_shape=True).table(sort_by=\"cpu_time_total\", row_limit=30))\n
can you see any correlation between the shape of the input and the cost of the operation?
(Optional) If you have a GPU you can also profile the operations on that device:
with profile(activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA], record_shapes=True) as prof:\n model(inputs)\n
(Optional) As an alternative to using profile
as an context-manager we can also use its .start
and .stop
methods:
prof = profile(...)\nprof.start()\n... # code I want to profile\nprof.stop()\n
Try doing this on the above example.
The torch.profiler.profile
function takes some additional arguments. What argument would you need to set to also profile the memory usage? (Hint: this page) Try doing it to the simple example above and make sure to sort the sample by self_cpu_memory_usage
.
As mentioned we can also get a graphical output for better inspection. After having done a profiling try to export the results with:
prof.export_chrome_trace(\"trace.json\")\n
you should be able to visualize the file by going to chrome://tracing
in any chromium based web browser. Can you still identify the information printed in the previous exercises from the visualizations?
Running profiling on a single forward step can produce misleading results as it only provides a single sample that may depend on what background processes that are running on your computer. Therefore it is recommended to profile multiple iterations of your model. If this is the case then we need to include prof.step()
to tell the profiler when we are doing a new iteration
with profile(...) as prof:\n for i in range(10):\n model(inputs)\n prof.step()\n
Try doing this. Is the conclusion this the same on what operations that are taken up most of the time? Have the percentage changed significantly?
Additionally, we can also visualize the profiling results using the profiling viewer in tensorboard.
Start by initializing the profile
class with an additional argument:
from torch.profiler import profile, tensorboard_trace_handler\nwith profile(..., on_trace_ready=tensorboard_trace_handler(\"./log/resnet18\")) as prof:\n ...\n
Try run a profiling (using a couple of iterations) and make sure that a file with the .pt.trace.json
is produced in the log/resnet18
folder.
Now try launching tensorboard
tensorboard --logdir=./log\n
and open the page http://localhost:6006/#pytorch_profiler, where you should hopefully see an image similar to the one below:
Image credit
Try poking around in the interface.
Tensorboard have a nice feature for comparing runs under the diff
tab. Try redoing a profiling run but use model = models.resnet34()
instead. Load up both runs and try to look at the diff
between them.
As an final exercise, try to use the profiler on the vae_mnist_working.py
file from the previous module on debugging, where you profile a hole training run (not only the forward pass). What is the bottleneck during the training? Is it still the forward pass or is it something else? Can you improve the code somehow based on the information from the profiler.
This end the module on profiling. If you want to go into more details on this topic we can recommend looking into line_profiler and kernprof. A downside of using python's cProfile
is that it can only profiling at an functional/modular level, that is great for identifying hotspots in your code. However, sometimes the cause of an computationally hotspot is a single line of code in a function, which will not be caught by cProfile
. An example would be an simple index operations such as a[idx] = b
, which for large arrays and non-sequential indexes is really expensive. For these cases line_profiler and kernprof are excellent tools to have in your toolbox. Additionally, if you do not like cProfile we can also recommend py-spy which is another open-source profiling tool for python programs.
Slides
Continues integration is a sub-discipline of the general field of Continues X. Continuous X is one of the core elements of modern DevOps, and by extension MLOps. Continuous X assumes that we have a (long) developer pipeline (see image below) where we want to make some changes to our code e.g:
Basically, any code change we will expect will have a influence on the final result. The problem with doing changes to the start of our pipeline is that we want the change to propagate all the way through to the end of the pipeline.
Image creditThis is where continuous X comes into play. The word continuous here refers to the fact that the pipeline should continuously be updated as we make code changes. You can also choose to think of this as the automatization of processes. The X then covers that the process we need to go through to automate steps in the pipeline depends on where we are in the pipeline e.g. the tools needed to do continuous integration are different from the tools needed to do continuous delivery.
In this session, we are going to focus on continuous integration (CI). As indicated in the image above, CI usually takes care of the first part of the developer pipeline which has to do with the code base, code building and code testing. This is paramount to step in automatization as we would rather catch bugs at the beginning of our pipeline than in the end.
Learning objectives
The learning objectives of this session are:
The Github Actions we learned about in M16 are an powerful tool that can be used to much more than simply running our tests tests that we write for our application. In this module we are going to look at how we can use it for continuously building docker images. As you have already seen docker building can take a couple of minutes to build each time we do changes to our code base. For this reason we really just want to build a new image every time we do a commit of our code. Thus, it should come as no surprise that we can also automate the building process and furthermore we can take advantage of online compute power to parallelize the process.
As discussed in the initial module on docker, docker hub is an online solution for storing build docker images in the cloud that is then easy to pull down on whatever machine you want to run on. Docker hub is free to use for personal use, as long as the images you push are public. We are in this session going to look how we can automatically build and push our docker builds to docker hub. In a future module we are also going to look at the exact same process of building and pushing containers but this time to an general cloud provider.
"},{"location":"s5_continuous_integration/auto_docker/#exercises","title":"\u2754 Exercises","text":"For these exercises you can choose to work with any docker file of your choosing. If you want an easy docker file, you can use the following:
FROM busybox\nCMD echo \"Howdy cowboy\"\n
Alternatively, you can choose to focus on automatizing the training and prediction docker files back from M9. You will most likely need to change the docker image for your applications if they contains any references to your data e.g. you have an COPY data/ data/
statement in the file. Since we do not store our data in Github, we cannot copy it during the build process.
Start by pushing whatever docker file you want that should be continuously build to your repository
Start by creating a Docker Hub account
Next, within Docker Hub create an access token by going to Settings -> Security
. Click the New Access Token
button and give it a name that you recognize.
Copy the newly created access token and head over to your Github repository online. Go to Settings -> Secrets -> Actions
and click the New repository secret
. Copy over the access token and give it the name DOCKER_HUB_TOKEN
. Additionally, add two other secrets DOCKER_HUB_USERNAME
and DOCKER_HUB_REPOSITORY
that contains your docker username and docker repository name respectively.
Next we are going to construct the actual Github actions workflow file:
name: Docker Image CI\n\non:\n push:\n branches: [ master ]\n\njobs:\n build:\n runs-on: ubuntu-latest\n steps:\n - uses: actions/checkout@v2\n - name: Build the Docker image\n run: |\n echo \"${{ secrets.DOCKER_HUB_TOKEN }}\" | docker login \\\n -u \"${{ secrets.DOCKER_HUB_USERNAME }}\" --password-stdin docker.io\n docker build . --file Dockerfile \\\n --tag docker.io/${{ secrets.DOCKER_HUB_USERNAME }}/${{ secrets.DOCKER_HUB_REPOSITORY }}:$GITHUB_SHA\n docker push docker.io/${{ secrets.DOCKER_HUB_USERNAME }}/${{ secrets.DOCKER_HUB_REPOSITORY }}:$GITHUB_SHA\n
The first part of the workflow file should look somewhat recognizable. However, the last three lines are where all the magic happens. Carefully go through them and figure out what they do. If you want some help you can looking at the help page for docker login
, docker build
and docker push
.
Upload the workflow to your github repository and check that it is being executed. If everything you should be able to see the the build docker image in your container repository in docker hub.
Make sure that you can execute docker pull
locally to pull down the image that you just continuously build
(Optional) To test that the container works directly in github you can also try to include an additional step that actually runs the container.
- name: Run container\n run: |\n docker run ...\n
That ends the session on continues docker building. We are going to revisit this topic after introducing the basic concepts of working in the cloud, as it will make our life easier in the long run when we get to continues deployment (CD) that our containers are stored the same place where we are going to run them. For completeness it is worth mentioning that docker hub also offers the possibility of building your images in a continues way, by specifying so called build rules.
"},{"location":"s5_continuous_integration/cml/","title":"M19 - Continuous Machine Learning","text":""},{"location":"s5_continuous_integration/cml/#continuous-machine-learning","title":"Continuous Machine Learning","text":"The continuous integration we have looked at until now is what we can consider \"classical\" continuous integration, that have its roots in DevOps and not MLOps. While the test that we have written and the containers ww have developed in the previous session have be around machine learning, everything we have done translate to completely to how it would be done if we had developed any other application did not include machine learning.
In this session, we are now gonna change gear and look at continuous machine learning (CML). As the name may suggest we are now focusing on automatizing actual machine learning processes. You may ask why we need continues integration principals baked into machine learning pipelines? The reason is the same as with any continues integration, namely that we have a bunch of checks that we want our newly trained model to pass before we trust it. Writing unittests
secures that our code is not broken, but there are other failure modes of a machine learning pipeline that should be checked before the model is ready for deployment:
Answering these questions in a continues way are possible through continuous machine learning. For this session, we are going to use cml
by iterative.ai for this session. Strictly speaking, using the cml
framework is not a necessary component for doing continuous machine learning but it streamlined way of doing this and offers tools to easily get a report about how a specific run performed. If we where just interested in trigging model training every time we do a git push
we essentially just need to include
run: python train.py\n
to any of our workflow files.
The figure below describes the overall process using the cml
framework. It should be clear that it is the very same process that we go through as in the other continues integration sessions: push code
-> trigger github actions
-> do stuff
. The new part in this session is that we want an report of the finding of the automated run to appear after the run is done.
We are first going to revisit our train.py
script. If we want cml
to automatically be able to report the performance of our trained model to us after it is trained, we need to give it some statistics to work with. Below is some psedo-code that computes the accuracy and the confusion matrix of our trained model. Create an copy of your training script (call it train_cml.py
) and make sure your script is also producing an classification report and confusion matrix as in the pseudo-code.
# assume we have a trained model\nimport matplotlib.pyplot as plt\nfrom sklearn.metrics import classification_report, confusion_matrix, ConfusionMatrixDisplay\npreds, target = [], []\nfor batch in train_dataloader:\n x, y = batch\n probs = model(x)\n preds.append(probs.argmax(dim=-1))\n target.append(y.detach())\n\ntarget = torch.cat(target, dim=0)\npreds = torch.cat(preds, dim=0)\n\nreport = classification_report(target, preds)\nwith open(\"classification_report.txt\", 'w') as outfile:\n outfile.write(report)\nconfmat = confusion_matrix(target, preds)\ndisp = ConfusionMatrixDisplay(cm = confmat, )\nplt.savefig('confusion_matrix.png')\n
Similar to what we have looked at until now, automation happens using github workflow files. The main difference from continuous integration we have looked on until now, is that we are actually going to train our model whenever we do a git push
. Copy the following code into a new workflow (called cml.yaml
) and add that file to the folder were you keep your workflow files.
name: train-my-model\non: [push]\njobs:\n run:\n runs-on: [ubuntu-latest]\n steps:\n - uses: actions/checkout@v2\n - uses: iterative/setup-cml@v1\n - name: Train model\n run: |\n pip install -r requirements.txt # install dependencies\n python train.py # run training\n - name: Write report\n env:\n # this authenticates that the right permissions are in place\n REPO_TOKEN: ${{ secrets.GITHUB_TOKEN }}\n run: |\n # send all information to report.md that will be reported to us when the workflow finish\n cat classification_report.txt >> report.md\n cml-publish confusion_matrix.png --md >> report.md\n cml-send-comment report.md\n
Nearly everything in the workflow file should look familiar, except the last two lines.
Try pushing the workflow file to your github repository and make sure that it completes. If it does not, you may need to adjust the workflow file slightly.
Send yourself a pull-request. I recommend seeing this very short video on how to send yourself a pull-request with a small change. If you workflow file is executed correctly you should see github-actions
commenting with a performance report on your PR.
(Optional) cml
is offered by the same people behind dvc
and it should therefore come as no surprise that these features can interact with each other. If you want to deep dive into this, here is a great starting point.
The ends the session on continues machine learning. If you have not already noticed, one limitation of using github actions is that their default runners e.g. runs-on: [ubuntu-latest]
are only CPU machines (see hardware config . As we all know, modern machine learning more or less requires hardware acceleration (=GPUs) to train within reasonable time. Luckily for us cml
also integrated with large cloud provides and I therefore recommend that after doing through the modules on cloud computing that you return to this exercise and experiment with setting up self-hosted runners.
Core Module
With the tests established in the previous module we are now ready to move on to actually implementing some continuous integration in our pipeline. As you probably have already realized testing your code locally may take cumbersome to do, because
For these reasons we want to automate the testing, such that it done every time we push to our repository. If we combine this with only pushing to branches and then only merging these branches whenever all automated testing have passed, our code should be fairly safe against unwanted bugs (assuming your tests are well covering your code).
"},{"location":"s5_continuous_integration/github_actions/#github-actions_1","title":"Github actions","text":"Github actions are the CI solution that Github provides. Each of your repositories gets 2,000 minutes of free testing per month which should be more than enough for the scope of this course (and probably all personal projects you do). Getting Github actions setup in a repository may seem complicated at first, but workflow files that you create for one repository can more or less be reused for any other repository that you have.
Lets take a look at how a github workflow file is organized:
name
runs-on
we can specify which operation system we want the workflow to run on. We also have the possibility to specify multiple.steps
. This is where we specify the actual commands that should be run when the workflow is executed.Start by creating a .github
folder in the root of your repository. Add a sub-folder to that called workflows
.
Go over this page that explains how to do automated testing of python code in github actions. You do not have to understand everything, but at least get a feeling of what a workflow file should look like.
We have provided a workflow file called tests.yml
that should run your tests for you. Place this file in the .github/workflows/
folder. The workflow file consist of three steps
First a python environment is setup (in this case python 3.8)
Next all dependencies required to run the test are installed
Finally, pytest
is called and test will be run
For the script to work you need to define the requirements.txt
and requirements_tests.txt
. The first file should contain all packages required to run your code. The second file is all additional packages required to run the tests. In your simple case it may very well be that the second file is empty, however sometimes additional packages are used for testing that are not strictly required for the scripts to run.
Finally, try pushing the changes to your repository. Hopefully your tests should just start, and you will after sometime see a green check mark next to hash of the commit. Also try to checkout the Actions tap where you can see the history of actions run.
Normally we develop code one operating system and just hope that it will work on other operating systems. However, CI enables us to automatically test on other systems than ourself.
The provided tests.yml
only runs on one operating system. Which one?
Alter the file (or write a new) that executes the test on the two other main operating systems that exist.
As the workflow is currently setup, github actions will destroy every downloaded package when the workflow has been executed. To improve this we can take advantage of caching
:
Figure out how to implement caching
in your workflow file. You can find a guide here and here.
When you have implemented a caching system go to Actions->Caches
in your repository and make sure that they are correctly added. It should look something like the image below
Measure how long your workflow takes before and after adding caching
to your workflow. Did it improve the runtime of your workflow?
(Optional) Code coverage can also be added to the workflow file by uploading it as an artifact after running the coverage. Follow the instructions in this post on how to do it.
As stated in the introduction, ideally we want to only push our code to branches, such that our workflows run before we actually merge code into our codebase. We can directly prevent bad behavior by adding branch protection rules to our repository. Take the image below as an example from one of my own PRs:
In this example, the PR cannot be merge to the main branch before the following is fulfilled: At least 2 reviewers with write access have approved the PR, all Github actions marked as Required are passing and all conversations needs to be resolved. Since not all important tests are passing, further changes are necessary. We want to implement something similar. Do the following:
On your Github repository of choice, go to Settings -> Branches -> Add branch protection rule
:
To your main/master branch add the following rules:
To test that everything works, try creating a PR (possibly with a small bug) and see that your main/master branch is protected
One problem you may have encountered is running your tests that have to do with your data, with the core problem being that your data is actually not stored in github (assuming you have done module M8 - DVC) and therefore cannot be tested. However, it is possible for us to download data while running our CI. Lets try to setup that:
The first problem is that we need our CI needs to be able to authenticate with the our storage solution. We can take advantage of an authentication file that is created the first time we push with DVC. It is located in $CACHE_HOME/pydrive2fs/{gdrive_client_id}/default.json
where $CACHE_HOME
depends on your operating system:
~/Library/Caches
~/.cache
This is the typical location, but it may vary depending on what distro you are running
{user}/AppData/Local
Find the file. The content should look similar to this (only some fields are shown):
{\n \"access_token\": ...,\n \"client_id\": ...,\n \"client_secret\": ...,\n \"refresh_token\": ...,\n ...\n}\n
The content of that file is should be treated as an password an not shared with the world and the relevant question is therefore how to use this info in public repository. The answer is github secrets, where we can store information, access it in our workflow files and it is still not public. Navigate to the secrets option (as shown below) and create a secret with the name GDRIVE_CREDENTIALS_DATA
that contains the content of the file you found in the previous exercise.
Afterwards, add the following code to your workflow file:
- uses: iterative/setup-dvc@v1\n- name: Get data\n run: dvc pull\n env:\n GDRIVE_CREDENTIALS_DATA: ${{ secrets.GDRIVE_CREDENTIALS_DATA }}\n
that runs dvc pull
using the secret authentication file. For help you can visit this small repository that implements the same workflow.
Finally, add the changes, commit, push and confirm that everything works as expected. You should now be able to run unit tests that depends on your input data.
In module M6 on good coding practices (optional module) of the course you where introduced to a couple of good coding practices such as being consistent with your coding style, how your Python packages are sorted and that your code follows certain standards. All this was done using the ruff
framework. In this set of exercises we will setup github workflows that will automatically test for this.
Create a new workflow file called codecheck.yml
, that implements the following three steps
Setup python environment
Installs ruff
Runs ruff check
and ruff format
on the repository
(HINT: You should be able to just change the last steps of the tests.yml
workflow file)
In addition to ruff
we also used mypy
in those set of exercies for checking if the typing we added to our code was good enough. Add another step to the codecheck.yml
file which runs mypy
on your repository.
Try to make sure that all steps are passing on repository. Especially mypy
can be hard to get passing, so this exercise formally only requires you to get ruff
passing.
When working with Github actions you will often encounter the following 4 concepts:
Try to define them with your own words.
Solutionyaml
file that defines the instructions to execute on specific events. Needs to be placed in the .github/workflows
folder.The on
attribute specify upon which events the workflow will be triggered. Assume you have set the on
attribute to the following:
on:\n push:\n branches: [main]\n pull_request:\n branches: [main]\n schedule:\n - cron: \"0 0 * * *\"\n workflow_dispatch: {}\n
What 4 events would trigger the execution of that action?
Solutionmain
would trigger itmain
would trigger itThe trigger can be executed by manually triggering it through the Github UI, example shown below
This ends the module on Github workflows. If you are more interested in this topic you can checkout module M31 on documentation which first including locally building some documentation for your project and afterwards use Github actions for deploying it to Github Pages. Additionally, Github also have a lot of templates already for running a lot CI tasks. If you try to create a workflow file directly in Github you may encounter the following page
We highly recommend checking this out if you want to write any other kind of CI pipeline in Github actions. We can also recommend this repository that have an list of awesome actions and checkout the act repository which is a tool for running your GitHub Actions locally!
"},{"location":"s5_continuous_integration/pre_commit/","title":"M17 - Pre commit","text":""},{"location":"s5_continuous_integration/pre_commit/#pre-commit","title":"Pre-commit","text":"One of the cornerstones of working with git is remembering to commit your work often. Often committing makes sure that it is easier to identify and revert unwanted changes that you have introduced, because the code changes becomes smaller per commit.
However, as you hopefully already seen in the course there are a lot of mental task to do before you actually write git commit
in the terminal. The most basic thing is of course making sure that you have saved all your changes, and you are not committing a not up-to-date file. However, this also includes tasks such as styling, formatting, making sure all tests succeeds etc. All these mental to-do notes does not mix well with the principal of remembering to commit often, because you in principal have to do them every time.
The obvious solution to this problem is to automate all or some of our mental task every time that we do a commit. This is where pre-commit hooks comes into play, as they can help us attach additional tasks that should be run every time that we do a git commit
.
Pre-commit simply works by inserting whatever workflow we want to automate in between whenever we do a git commit
and afterwards would do a git push
.
The system works by looking for a file called .pre-commit-config.yaml
that we can configure. If we execute
pre-commit sample-config | out-file .pre-commit-config.yaml -encoding utf8\n
you should get a sample file that looks like
# See https://pre-commit.com for more information\n# See https://pre-commit.com/hooks.html for more hooks\nrepos:\n- repo: https://github.com/pre-commit/pre-commit-hooks\n rev: v3.2.0\n hooks:\n - id: trailing-whitespace\n - id: end-of-file-fixer\n - id: check-yaml\n - id: check-added-large-files\n
the file structure is very simple:
id
of the different hooks. The id
corresponds to an id
in this file: https://github.com/pre-commit/pre-commit-hooks/blob/master/.pre-commit-hooks.yamlWhen we are done defining our .pre-commit-config.yaml
we just need to install it
pre-commit install\n
this will make sure that the file is automatically executed whenever we run git commit
Install pre-commit
pip install pre-commit\n
Next create the sample file
pre-commit sample-config > .pre-commit-config.yaml\n
The sample file already contains 4 hooks. Make sure you understand what each do and if you need them at all.
pre-commit
works by hooking into the git commit
command, running whenever that command is run. For this to work, we need to install the hooks into git commit
. Run
pre-commit install\n
to do this.
Try to commit your recently created .pre-commit-config.yaml
file. You will likely not do anything, because pre-commit
only check files that are being committed. Instead try to run
pre-commit run --all-files\n
that will check every file in your repository.
Try adding at least another check from the base repository to your .pre-commit-config.yaml
file.
If you have completed the optional module M7 on good coding practice you will have learned about the linter ruff
. ruff
comes with its own pre-commit hook. Try adding that to your .pre-commit-config.yaml
file and see what happens when you try to commit files.
(Optional) Add more hooks to your .pre-commit-config.yaml
.
Sometimes you are in a hurry, so make sure that you also can do commits without running pre-commit
e.g.
git commit -m <message> --no-verify\n
Finally, figure out how to disable pre-commit
again (if you get tired of it).
That was all about how pre-commit
can be used to automate tasks. If you want to deep dive more into the topic you can checkout this page on how to define your own pre-commit
hooks.
Core Module
What often comes to mind for many developers, when discussing continuous integration (CI) is code testing. CI should secure that whenever a codebase is updated it is automatically tested such that if bugs have been introduced in the codebase it will be caught early on. If you look at the MLOps cycle, CI is one of the cornerstones of the operations part. However, it should be noted that applying CI does not magically secure that your code does not break. CI is only as strong as the tests that are automatically executed. CI simply structures and automates this.
Quote
Continuous Integration doesn\u2019t get rid of bugs, but it does make them dramatically easier to find and remove. Martin Fowler, Chief Scientist, ThoughtWorks
Image creditThe kind of tests we are going to look at are called unit testing. Unit testing refers to the practice of writing test that tests individual parts of your code base to test for correctness. By unit, you can therefore think of a function, module or in general any object. By writing tests in this way it should be very easy to isolate which part of the code broke after an update to the code base. Another way to test your code base would be through integration testing which is equally important but we are not going to focus on it in this course.
Unit tests (and integration tests) are not a unique concept to MLOps but are a core concept of DevOps. However, it is important to note that testing machine learning-based systems is much more difficult than traditional systems. The reason for this is that machine learning systems depend on data, that influences the state of our system. For this reason, we not only need unit tests and integration tests of our code we also need data testing, infrastructure testing and more monitoring to check that we stay within the data distribution we are training on (more on this in module M25 on data drifting). This added complexity is illustrated in the figure below.
"},{"location":"s5_continuous_integration/unittesting/#pytest","title":"Pytest","text":"Before we can begin to automate testing of our code base we of course need to write the tests first. It is both a hard and tedious task to do but arguably the most important aspect of CI. Python offers a couple of different libraries for writing tests. We are going to use pytest
.
The following exercises should be applied to your MNIST repository
The first part of doing CI is writing the unit tests. We do not expect you to cover every part of the code you have developed but try to at least write tests that cover two files. Start by creating a tests
folder.
Read the getting started guide for pytest which is the testing framework that we are going to use
Install pytest:
pip install pytest\n
Write some tests. Below are some guidelines on some tests that should be implemented, but you are of course free to implement more tests. You can at any point check if your tests are passing by typing in a terminal
pytest tests/\n
When you implement a test you need to follow two standards, for pytest
to be able to find your tests. First any files created (except __init__.py
) should always start with test_*.py
. Secondly, any test implemented needs to be wrapped into its own function that again needs to start with test_
:
# this will be found and executed by pytest\ndef test_something():\n ...\n\n# this will not be found and executed by pytest\ndef something_to_test():\n ...\n
Start by creating a tests/__init__.py
file and fill in the following:
import os\n_TEST_ROOT = os.path.dirname(__file__) # root of test folder\n_PROJECT_ROOT = os.path.dirname(_TEST_ROOT) # root of project\n_PATH_DATA = os.path.join(_PROJECT_ROOT, \"Data\") # root of data\n
these can help you refer to your data files during testing. For example, in another test file, I could write
from tests import _PATH_DATA\n
which then contains the root path to my data.
Data testing: In a file called tests/test_data.py
implement at least a test that checks that data gets correctly loaded. By this, we mean that you should check
def test_data():\n dataset = MNIST(...)\n assert len(dataset) == N_train for training and N_test for test\n assert that each datapoint has shape [1,28,28] or [784] depending on how you choose to format\n assert that all labels are represented\n
where N_train
should be either 30.000 or 50.000 depending on if you are just the first subset of the corrupted MNIST data or also including the second subset. N_test
should be 5000.
Model testing: In a file called tests/test_model.py
implement at least a test that checks for a given input with shape X that the output of the model has shape Y.
Training testing: In a file called tests/test_training.py
implement at least one test that asserts something about your training script. You are here given free hands on what should be tested but try to test something that risks being broken when developing the code.
Good code raises errors and gives out warnings in appropriate places. This is often in the case of some invalid combination of input to your script. For example, your model could check for the size of the input given to it (see code below) to make sure it corresponds to what you are expecting. Not implementing such errors would still result in Pytorch failing at a later point due to shape errors, however, these custom errors will probably make more sense to the end user. Implement at least one raised error or warning somewhere in your code and use either pytest.raises
or pytest.warns
to check that they are correctly raised/warned. As inspiration, the following implements ValueError
in code belonging to the model:
# src/models/model.py\ndef forward(self, x: Tensor):\n if x.ndim != 4:\n raise ValueError('Expected input to a 4D tensor')\n if x.shape[1] != 1 or x.shape[2] != 28 or x.shape[3] != 28:\n raise ValueError('Expected each sample to have shape [1, 28, 28]')\n
which would be captured by a test looking something like this:
# tests/test_model.py\ndef test_error_on_wrong_shape():\n with pytest.raises(ValueError, match='Expected input to a 4D tensor')\n model(torch.randn(1,2,3))\n
A test is only as good as the error message it gives, and by default, assert
will only report that the check failed. However, we can help ourselves and others by adding strings after assert
like
assert len(train_dataset) == N_train, \"Dataset did not have the correct number of samples\"\n
Add such comments to the assert statements you just did.
The tests that involve checking anything that has to do with our data, will of course fail if the data is not present. To future-proof our code, we can take advantage of the pytest.mark.skipif
decorator. Use this decorator to skip your data tests if the corresponding data files does not exist. It should look something like this
import os.path\n@pytest.mark.skipif(not os.path.exists(file_path), reason=\"Data files not found\")\ndef test_something_about_data():\n ...\n
You can read more about skipping tests here
After writing the different tests, make sure that they are passing locally.
We often want to check a function/module for various input arguments. In this case, you could write the same test over and over again for the different input, but pytest
also has built-in support for this with the use of the pytest.mark.parametrize decorator. Implement a parametrized test and make sure that it runs for different inputs.
There is no way of measuring how good the test you have written is. However, what we can measure is the code coverage. Code coverage refers to the percentage of your codebase that actually gets run when all your tests are executed. Having a high coverage at least means that all your code will run when executed.
Install coverage
pip install coverage\n
Instead of running your tests directly with pytest
, now do
coverage run -m pytest tests/\n
To get a simple coverage report simply type
coverage report\n
which will give you the percentage of cover in each of your files. You can also write
coverage report -m\n
to get the exact lines that were missed by your tests.
Finally, try to increase the coverage by writing a new test that runs some of the lines in your codebase that are not covered yet.
Often coverage
reports the code coverage on files that we actually do not want to get a code coverage for. Figure out how to configure coverage
to exclude some files.
Assuming you have a code coverage of 100%, would you expect that no bugs are present in your code?
SolutionNo, code coverage is not a guarantee that your code is bug-free. It is just a measure of how many lines of code are run when your tests are executed. Therefore, there may still be some corner case that is not covered by your tests and will result in a bug. However, having a high code coverage is a good indicator that you have tested your code.
Consider the following code:
@pytest.mark.parametrize(\"network_size\", [10, 100, 1000])\n@pytest.mark.parametrize(\"device\", [\"cpu\", \"cuda\"])\nclass MyTestClass:\n @pytest.mark.parametrize(\"network_type\", [\"alexnet\", \"squeezenet\", \"vgg\", \"resnet\"])\n @pytest.mark.parametrize(\"precision\", [torch.half, torch.float, torch.double])\n def test_network1(self, network_size, device, network_type, precision):\n if device == \"cuda\" and not torch.cuda.is_available():\n pytest.skip(\"Test requires cuda\")\n model = MyModelClass(network_size, network_type).to(device=device, dtype=precision)\n ...\n\n @pytest.mark.parametrize(\"add_dropout\", [True, False])\n def test_network2(self, network_size, device, add_dropout):\n if device == \"cuda\" and not torch.cuda.is_available():\n pytest.skip(\"Test requires cuda\")\n model = MyModelClass2(network_size, add_dropout).to(device)\n ...\n
how many tests are executed when running the above code?
SolutionThe answer depends on whether or not we are running on a GPU-enabled machine. The test_network1
has 4 parameters, network_size, device, network_type, precision
, that respectively can take on 3, 2, 4, 3
values meaning that in total that test will be running 3x2x4x3=72
times with different parameters on a GPU-enabled machine and 36 on a machine without a GPU. A similar calculation can be done for test_network2
, which only has three factors network_size, device, add_dropout
that result in 3x2x2=12
test on a GPU-enabled machine and 6 on a machine without a GPU. In total, that means 84 tests would run on a machine with a GPU and 42 on a machine without a GPU.
That covers the basics of writing unit tests for Python code. We want to note that pytest
of course is not the only framework for doing this. Python has a built-in framework called unittest for doing this also (but pytest
offers a bit more features). Another open-source framework that you could choose to check out is hypothesis which can help catch errors in corner cases of your code. In addition to writing unit tests it is also highly recommended to test code that you include in your docstring belonging to your functions and modulus to make sure that any code there is in your documentation is also correct. For such testing, we can highly recommend using Python built-in framework doctest.
Slides
Running computations locally is often sufficient when only playing around with code in initial phase of development. However, to really scale your experiments you will need more computing power than what your standard laptop/desktop can offer. You probably already have experience with running on a local cluster or similar but todays topic is about utilizing cloud computing.
Image creditThere exist a numerous amount of cloud compute providers with some of the biggest being:
The all have slight advantages and disadvantages over each others. In this course we are going to focus on Google cloud, because they have been kindly enough to sponsor $50 of cloud credit to each student. If you happen to run out of credit, you can also get some free credit for a limited amount of time when you signup with a new account. What's important to note is that all these different cloud providers all have the same set of services, and that learning how to use the services of one cloud provider in many cases translate to also know how to use the same services at another cloud provider. The services are called something different and can have a bit of a different interface/interaction pattern but in the end it does not really matter.
Todays exercises are about getting to know how to work with the cloud. If you are in doubt about anything or want to deep dive into some topics, I can recommend watching this series of videos or going through the general docs.
Learning objectives
The learning objectives of this session are:
Core Module
Google cloud project (GCP) is the cloud service provided by Google. The key concept, or selling point, of any cloud provider is the idea of near-infinite resources. Without the cloud it simply is not feasible to do many modern deep learning and machine learning tasks because they cannot be scaled locally.
The image below shows a subset of all the different services that the Google cloud platform offers. The ones marked in red are the ones we are actually going to investigate in this course. Therefore, if you get done with exercises early I highly recommend that you deep dive more into the Google cloud platform.
Image credit"},{"location":"s6_the_cloud/cloud_setup/#exercises","title":"\u2754 Exercises","text":"As the first step we are going to get you setup with some Google cloud credits.
Go to https://learn.inside.dtu.dk. Go to this course. Find the recent message where there should be a download link and instructions on how to claim the $50 cloud credit. Please do not share the link anywhere as there are a limited amount of coupons. If you are not officially taking this course at DTU, Google gives $300 cloud credits whenever you signup with a new account. NOTE that you need to provide a credit card for this so make sure to closely monitor your credit use so you do not end spending more than the free credit.
Login to the homepage of gcp. It should look like this:
Go to billing and make sure that your account is showing $50 of cloud credit
make sure to also checkout the Reports
throughout the course. When you are starting to use some of the cloud services these tabs will update with info about how much time you can use before your cloud credit runs out. Make sure that you monitor this page as you will not be given another coupon.
One way to stay organized within GCP is to create projects.
Create a new project called dtumlops
. When you click create
you should get a notification that the project is being created. The notification bell is good way to make sure how the processes you are running are doing throughout the course.
For setup we are going to install gcloud
. gcloud
is the command line interface for working with our Google cloud account. Nearly everything that we can do through the web interface we can also do through the gcloud
interface. Follow the installation instructions here for your specific OS.
After installation, try in a terminal to type:
gcloud -h\n
the command should and show the help page. If not, something went wrong in the installation (you may need to restart after installing).
Now login by typing
gcloud auth login\n
you should be sent to an web page where you link your cloud account to the gcloud
interface. Afterwards, also run this command:
gcloud auth application-default login\n
If you at some point want to revoke this you can type:
gcloud auth revoke\n
Next you will need to set the project that we just created. In your web browser under project info, you should be able to see the Project ID
belonging to your dtumlops
project. Copy this an type the following command in a terminal
gcloud config set project <project-id>\n
You can also get the project info by running
gcloud projects list\n
Next install the Google cloud python API:
pip install --upgrade google-api-python-client\n
Make sure that the python interface is also installed. In a python terminal type
import googleapiclient\n
this should work without any errors.
(Optional) If you are using VSCode you can also download the relevant extension called Cloud Code
. After installing it you should see a small Cloud Code
button in the action bar.
Finally, we need to activate a couple of developer APIs that are not activated by default. In a terminal write
gcloud services enable apigateway.googleapis.com\ngcloud services enable servicemanagement.googleapis.com\ngcloud services enable servicecontrol.googleapis.com\n
you can always check which services are enabled by typing
gcloud services list\n
After following these step your laptop should hopefully be setup for using gcp
locally. You are now ready to use their services, both locally on your laptop and in the cloud console.
A big part of using the cloud in a bigger organisation has to do with Admin and quotas. Admin here in general refers to the different roles that users of GCP and quotas refers to the amount of resources that a given user has access to. For example one employee, lets say a data scientist, may only be granted access to certain GCP services that have to do with development and training of machine learning model, with X
amounts of GPUs available to use to make sure that the employee does not spend too much money. Another employee, a devops engineer, probably do not need access to the same services and not necessarily the same resources.
In this course we are not going to focus too much on this aspect but it is important to know that it exists. One feature you are going to need for doing the project is how to share a project with other people. This is done through the IAM (Identities and Access Management) page. Simply click the Grant Access
button, search for the email of the person you want to share the project with and give them either Viewer
, Editor
or Owner
access, depending on what you want them to be able to do. The figure below shows how to do this.
What we are going to go through right now is how to increase the quotas for how many GPUs you have available for your project. By default any free accounts in GCP (or accounts using teaching credits) the default quota for GPUs that you can use is either 0 or 1 (their policies sometimes changes). We will in the exercises below try to increase it.
"},{"location":"s6_the_cloud/cloud_setup/#exercises_1","title":"\u2754 Exercises","text":"Start by enabling the Compute Engine
service. Simply search for it in the top search bar. It should bring you to the a page where you can enable the service (may take some time). We are going to look more into this service in the next module.
Next go to the IAM & Admin
page, again search for it in the top search bar. The remaining steps are illustrated in the figure below.
Go to the quotas page
In the search field search for GPUs (all regions)
(needs to match exactly, the search field is case sensitive), such that you get the same quota as in the image.
In the limit you can see what your current quota for the number of GPUs you can use are. Additional, to the right of the limit you can see the current usage. It is worth checking in on if you are ever in doubt if a job is running on GPU or not.
Click the quota and afterwards the Edit qoutas
button.
In the pop-op window, increase your limit to either 1 or 2.
After sending your request you can try clicking the Increase requests
tab to see the status of your request
If you are ever running into errors when working in GPU that contains statements about quotas
you can always try to go to this page and see what you are actually allowed to use currently and try to increase it. For example, when you get to training machine learning models using Vertex AI in the next module, you would most likely need to ask for quota increase for that service as well.
Finally, we want to note that a quota increase is sometimes not allowed within 24 hours of creating an account. If your request gets rejected, we recommend to wait a day and try again. If this does still not work, you may need to use their services some more to make sure you are not a bot that wants to mine crypto.
"},{"location":"s6_the_cloud/cloud_setup/#knowledge-check","title":"\ud83e\udde0 Knowledge check","text":"What considerations to take when choosing an GCP region for running a new application?
SolutionA series of factors may influence your choice of region, including:
The 3 major cloud providers all have the same services, but they are called something different depending on the provider. What are the corresponding names of these GCP services in AWS and Azure?
It is important to know these correspondences to navigate blogpost etc. about MLOps on the internet.
Solution GCP AWS Azure Compute Engine Elastic Compute Cloud (EC2) Virtual Machines Cloud storage Simple Storage Service (S3) Blob Storage Cloud functions Lambda Functions Serverless Compute Cloud run App Runner, Fargate, Lambda Container Apps, Container Instances Cloud build CodeBuild DevOps Vertex AI SageMaker AI PlatformCore Module
In this set of exercises we are going to get more familiar with the using some of the resources that the Google cloud project offers.
"},{"location":"s6_the_cloud/using_the_cloud/#compute","title":"Compute","text":"The most basic service of any cloud provider is the ability to create and run virtual machines. In gcp
this service is called Compute Engine API. A virtual machine allows you to essentially run an operating system that behaves like a completely separate computer. There are many reasons why one to use virtual machines:
Virtual machines allow you to scale your operations, essentially giving you access to infinitely many individual computers
Virtual machines allow you to use large scale hardware. For example if you are developing an deep learning model on your laptop and want to know the inference time for a specific hardware configuration, you can just create a virtual machine with those specs and run your model.
Virtual machines allow you to run processes in the \"background\". If you want to train a model for a week or more, you do not want to do this on your own laptop as you cannot really move it or do anything with while it is training. Virtual machines allow you to just launch a job and forget about it (at least until you run out of credit).
We are now going to start actually using the cloud.
Click on the Compute Engine
tab in sidebar on the homepage of gcp
.
Try to Create instance
. You will see the following image below.
Give it a meaningful name, set the location to some location that is closer to where you actually is (to reduce latency). Finally try to adjust the the configuration a bit. What two factors are effecting the price of the compute unit?
After figuring this out, create a e2-medium
instance (leave rest configured as default). Before clicking the Create
button make sure to check the Equavalent Command Line
button. You should see a very long command that you could have typed instead to do the exact same.
Now in a local terminal type:
gcloud compute instances list\n
you should hopefully see the instance you have just created.
You can start a terminal directly by typing:
gcloud beta compute ssh --zone <zone> <name> --project <project-id>\n
You can always see the exact command that you need to run to ssh
to an VM by selecting the View gcloud command
option in the Compute Engine overview (see image below).
While logged into the instance, check if Python and Pytorch is installed? You should see that neither is installed. The VM we have only specified what compute resources it should have, and not what software should be in it. We can fix this by starting VMs based on specific docker images (its all coming together).
gcp
Comes with a number of ready-to-go images for doing deep learning. More info can be found here. Try, running this line:
gcloud container images list --repository=\"gcr.io/deeplearning-platform-release\"\n
what does the output show?
Next, start (in the terminal) a new instance using a Pytorch image. The command for doing it should look something like this:
gcloud compute instances create <instance_name> \\\n --zone=<zone> \\\n --image-family=<image-family> \\\n --image-project=deeplearning-platform-release \\\n # add these arguments if you want to run on GPU\n --accelerator=\"type=nvidia-tesla-K80,count=1\" \\\n --maintenance-policy TERMINATE \\\n --metadata=\"install-nvidia-driver=True\" \\\n
You can find more info here on what <image-family>
should have as value and what extra argument you need to add if you want to run on GPU (if you have access).
ssh
to the VM as one of the previous exercises. Confirm that the container indeed contains both a python installation and Pytorch is also installed. Hint: you also have the possibility through the web page to start a browser session directly to the VMs you create:
Finally, everything that you have done locally can also be achieved through the web terminal, which of course comes pre-installed with the gcloud
command etc.
Try out launching this and run some of the commands from the previous exercises.
Stopping VMs
If you are not careful you can end up wasting a lot of credits on virtual machines that you are not using. VMs are charged by the minute, so even if you are not using them you are still paying for them. Therefore, it is important that you remember to stop your VMs when you are not using them. You can do this by either clicking the Stop
button in the VM overview page or by running the following command:
gcloud compute instances stop <instance-name>\n
"},{"location":"s6_the_cloud/using_the_cloud/#data-storage","title":"Data storage","text":"Another big part of cloud computing is storage of data. There are many reason that you want to store your data in the cloud including:
Cloud storage is luckily also very cheap. Google cloud only takes around $0.026 per GB per month. This means that around 1 TB of data would cost you $26 which is more than what the same amount of data would cost on Goggle Drive, but the storage in Google cloud is much more focused on enterprise where you have a need for accessing data through an API.
"},{"location":"s6_the_cloud/using_the_cloud/#exercises_1","title":"\u2754 Exercises","text":"When we did the exercise on data version control, we made dvc
work together with our own Google drive to storage data. However, a big limitation of this is that we need to authentic each time we try to either push or pull the data. The reason is that we need to use an API instead which is offered through gcp
.
We are going to follow the instructions from this page
Lets start by creating a data storage. On the GCP startpage, in the sidebar, click on the Cloud Storage
. On the next page click the Create bucket
:
Give the bucket an unique name, set it to a region close by and importantly remember to enable Object versioning under the last tab. Finally click Create
.
After creating the storage, you should be able to see it online and you should be able to see it if you type in your local terminal:
gsutil ls\n
gsutil
is an additional command to gcloud
, that provides more command line options.
Next we need the Google storage extension for dvc
pip install dvc[gs]\n
Now in your MNIST repository where you have already configured dvc, we are going to change the storage from our Google drive to our newly created Google cloud storage.
dvc remote add -d remote_storage <output-from-gsutils>\n
In addition we are also going to modify the remote to support object versioning (called version_aware
in dvc
):
dvc remote modify remote_storage version_aware true\n
This will change the default way that dvc
handles data. Instead of just storing the latest version of the data as content-addressable storage it will now store the data as it looks in our local repository, which allows us to not only use dvc
to download our data.
The above command will change the .dvc/config
file. git add
and git commit
the changes to that file. Finally, push data to the cloud
dvc push\n
Finally, make sure that you can pull without having to give your credentials. The easiest way to see this is to delete the .dvc/cache
folder that should be locally on your laptop and afterwards do a dvc pull
.
This setup should work when trying to access the data from your laptop, which we authenticated in the previous module. However, how can you access the data from a virtual machine, inside a docker container or from a different laptop? We in general recommend two ways:
You can make the bucket public accessible e.g. no authentication needed. That means that anyone with the url to the data can access it. This is the easiest way to do it, but also the least secure. You can read more about how to make your buckets public here.
You can create a service account which is a more secure way of accessing data. A service account is essentially a second user which you can give access to specific services. You can read more about how to create a service account here. Once you have created a service account you can give it access to a specific bucket by going to the Permissions
tab of the bucket and add the service account as a member.
If you need to authenticate your service account from a VM, you can do it by running the following command:
gcloud auth activate-service-account --key-file=<key-file>\n
where the <key-file
is the json file that you downloaded when you created the service account (DO NOT SHARE THIS).
You should hopefully at this point have seen the strength of using containers to create reproducible environments. They allow us to specify exactly the software that we want to run inside our VMs. However, you should already have run into two problems with containers
For this reason we want to move both the building process and the storage of images to the cloud. In GCP the service for this is called Artifact registry, formerly known as Container registry.
"},{"location":"s6_the_cloud/using_the_cloud/#exercises_2","title":"\u2754 Exercises","text":"For the purpose of these exercise I recommend that you start out with a dummy version of some code to make sure that the building process do not take too long. You are more than free to fork this repository. The repository contains a simple python script that does image classification using sklearn. The docker images for this application are therefore going to be substantially faster to build and smaller in size than the images we are used to that uses Pytorch.
Start by enabling the service: Google Artifact Registry API
and Google Cloud Build API
. This can be done through the web side (by searching for the services) or can also be enabled from the terminal:
gcloud services enable artifactregistry.googleapis.com\ngcloud services enable cloudbuild.googleapis.com\n
Google cloud building can in principal work out of the box with docker files. However, the recommended way is to add specialized cloudbuild.yaml
files. They should look something like this:
steps:\n - name: 'gcr.io/cloud-builders/docker'\n args: ['build', '-t', 'gcr.io/<project-id>/<image-name>', '.']\n - name: 'gcr.io/cloud-builders/docker'\n args: ['push', 'gcr.io/<project-id>/<image-name>']\n
which essentially is a basic yaml file that contains a list of steps, where each step consist of the service that should be used and the arguments for that service. In the above example we are calling the same service (cloud-builders/docker
) with different arguments (build
and then push
). Implement such a file in your repository. Hint: if you forked the repository then you at least need to change the <project-id>
.
From the gcp
homepage, navigate to the triggers panel:
Click on the manage repositories.
From there, click the Connect Repository
and go through the steps of authenticating your github profile with gcp
and choose the repository that you want to setup build triggers. For now, skip the Create a trigger (optional)
part by pressing Done
in the end.
Navigate back to the Triggers
homepage and click Create trigger
. Set the following:
Push to branch
^main$
Autodetected
or Cloud build configuration file
Finally click the Create
button and the trigger should show up on the triggers page.
To activate the trigger, push some code to the chosen repository.
Go to the Cloud Build
page and you should see the image being build and pushed.
Try clicking on the build to checkout the build process and building summary. As you can see from the image, if a build is failing you will often find valuable info by looking at the build summary. If you build is failing try to configure it to run in one of these regions: us-central1, us-west2, europe-west1, asia-east1, australia-southeast1, southamerica-east1
as specified in the documentation.
If/when your build is successful, navigate to the Artifact Registry
page. You should hopefully find that the image you just build was pushed here. Congrats!
Finally, to to pull your image down to your laptop
docker pull gcr.io/<project-id>/<image_name>:<image_tag>\n
you will need to authenticate docker
with gcp
first. Instructions can be found here, but the following command should hopefully be enough to make docker
and gcp
talk to each other:
gcloud auth configure-docker\n
Note: To do this you need to have docker
actively running in the background, as any other time you want to use docker
.
Automatization through the cloud is in general the way to go, but sometimes you may want to manually create images and push them to the registry. Figure out how to push an image to your Container Registry
. For simplicity you can just push the busybox
image you downloaded during the initial docker exercises. This page should help you with exercise.
As our final step in our journey through different GCP services in this module we are going to look at training of our models. This is one of the important tasks that GCP can help us with because we can always rent more hardware as long as we have credits, meaning that we can both scale horizontal (run more experiments) and vertical (run longer experiments).
We are going to checkout two ways of running our experiments. First we are going to return to the Compute Engine service because it gives the most simple form of scaling of experiments. That is: we create a VM with a appropriate docker image, we start it and login to the VM and we run our experiments. It is possible for most people to run a couple of experiments in parallel this way. However, what if there was an abstract layer that automatically created VM for us, lunched our experiments and the close the VM afterwards?
This is where Vertex AI service comes into play. This is a dedicated service for handling ML models in GCP in the cloud. Vertex AI is in principal and end-to-end service that can take care of everything machine learning related in the cloud. In this course we are primarily focused on just the training of our models, and then use other services for different parts of our pipeline.
"},{"location":"s6_the_cloud/using_the_cloud/#exercises_3","title":"\u2754 Exercises","text":"Lets start by see how we could train a model using Pytorch using the Compute Engine service:
Start by creating a appropriate VM. If you want to start a VM that have Pytorch pre-installed with only CPU support you can run the following command
gcloud compute instances create <instance-name> \\\n --zone europe-west1-b \\\n --image-family=pytorch-latest-cpu \\\n --image-project=deeplearning-platform-release\n
alternatively, if you have access to GPU in your GCP account you could start a VM in the following way
gcloud compute instances create <instance-name> \\\n --zone europe-west4-a \\\n --image-family=pytorch-latest-gpu \\\n --image-project=deeplearning-platform-release \\\n --accelerator=\"type=nvidia-tesla-v100,count=1\" \\\n --metadata=\"install-nvidia-driver=True\" \\\n --maintenance-policy TERMINATE\n
Next login into your newly created VM. You can either open an ssh
terminal in the cloud console or run the following command
gcloud beta compute ssh <instance-name>\n
It is recommend to always check that the VM we get is actually what we asked for. In this case the VM should have Pytorch pre-installed so lets check for that by running
python -c \"import torch; print(torch.__version__)\"\n
Additionally, if you have a VM with GPU support also try running the nvidia-smi
command.
When you have logged in to the VM, it works as your own machine. Therefore to run some training code you would need to do the same setup step you have done on your own machine: clone your github, install dependencies, download data, run code. Try doing this to make sure you can train a model.
(Optional, may not work as intended) The last step in the previous exercise involves a lot of setup that would be necessary to do every time we create a new VM, making horizontal scaling of experiments cumbersome. However, we have already developed docker images that can take care of most of the setup.
Lets for simplicity just create a very small docker image (called gcp_vm_tester.dockerfile
) that you can use
FROM gcr.io/deeplearning-platform-release/pytorch-cpu\nRUN pip install matplotlib\n
this basically just extends the base Pytorch image to also install matplotlib. The important part about the docker images that we want to use here is that they should not have an ENTRYPOINT
at the end, because we do not want the docker container to actually run our scripts, just install dependencies on startup.
Lets build docker and manually push it to our container repository in gcp. Build with:
docker build -f gcp_vm_tester.dockerfile.dockerfile . -t gcp_vm_tester:latest\n
and then push with
docker tag tester gcr.io/<project-id>/gcp_vm_tester\ndocker push gcr.io/<project-id>/gcp_vm_tester\n
confirm by going to the container registry in the cloud console and check that the image has been correctly pushed.
Lets then create a VM with that particular docker image. Instead of using gcloud compute instances create
we are now using the gcloud compute instances create-with-container
command
gcloud compute instances create-with-container <instance-name> \\\n --container-image=gcr.io/<project-id>/gcp_vm_tester\n --zone europe-west1-b\n
Confirm that everything works by accessing your newly created VM and run both of these commands
python -c \"import torch; print(torch.__version__)\"\npython -c \"import matplotlib; print(matplotlib.__version__)\"\n
We are now moving on to the final way to train our code, using Vertex AI
service.
Start by enabling it by searching for Vertex AI
in the cloud console and go to the service
The way we are going to use Vertex AI is to create custom jobs because we have already developed docker containers that contains everything to run our code. Thus the only command that we actually need to use is gcloud ai custom-jobs create
command. An example here would be:
gcloud ai custom-jobs create \\\n --region=europe-west1 \\\n --display-name=test-run \\\n --config=config.yaml\n
Essentially, this command combines everything into one command: it first creates a VM with the specs specified by a configuration file, then loads a container specified again in the configuration file and finally it runs everything. A example of a config file could be:
# config_cpu.yaml\nworkerPoolSpecs:\n machineSpec:\n machineType: n1-highmem-2\n replicaCount: 1\n containerSpec:\n imageUri: gcr.io/<project-id>/<docker-img>\n
if you only want to run on CPU and another example for GPU:
# config_gpu.yaml\nworkerPoolSpecs:\n machineSpec:\n machineType: n1-standard-8\n acceleratorType: NVIDIA_TESLA_T4 #(1)!\n acceleratorCount: 1\n replicaCount: 1\n containerSpec:\n imageUri: gcr.io/<project-id>/<docker-img>\n
you can read more about the configuration formatting here and the different types of machines here. Try to execute a job using the gcloud ai custom-jobs create
command. For additional documentation you can checkout the documentation on the command and this page and this page
Assuming you manage to lunch a job, you should see an output like this:
To executing the commands that is outputted to look at both the status and the progress of your job.
In addition you can also visit the Custom Jobs
tab in training
part of Vertex AI
Check it out.
During custom training we do not necessarily need to use dvc
for downloading our data. A more efficient way is to use cloud storage as a mounted file system. This allows us to access data directly from the cloud storage without having to download it first. All our training jobs are automatically mounted a gcs
folder in the root directory. Try to access the data from your training script:
# loading from a bucket using mounted file system\ndata = torch.load('/gcs/<my-bucket-name>/data.pt')\n# writing to a bucket using mounted file system\ntorch.save(data, '/gcs/<my-bucket-name>/data.pt')\n
is should speed up the training process a bit.
This ends the session on how to use Google cloud services for now. In a future session we are going to investigate a bit more of the services offered in GCP, in particular for deploying the models that we have just trained.
"},{"location":"s7_deployment/","title":"08. Model deployment","text":"Slides
Lets say that you have spend 1000 GPU hours and trained the most awesome model that you want to share with the world. One way to do this is of course to just place all your code in a github repository, upload a file with the trained model weights to your favorite online storage (assuming it is too big for github to handle) and ask people to just download your code and the weights to run the code by themselves. This is a fine approach in a small research setting, but in production you need to be able to deploy the model to an environment that is fully contained such that people can just execute without looking (too hard) at the code.
Image credit
In this session we try to look at methods specialized towards deployment of models on your local machine and also how to deploy services in the cloud.
Learning objectives
The learning objectives of this session are:
fastapi
and run it locallyCore Module
Before we can get deployment of our models we need to understand concepts such as APIs and requests. The core reason for this is that we need a new abstraction layer on top of our applications that are not Python-specific. While Python is the defacto language for machine learning, we cannot expect everybody else to use it and in particular, we cannot expect network protocols (both locally and external) to be able to communicate with our Python programs out of the box. For this reason, we need to understand requests, in particular HTTP requests and how to create APIs that can interact with those requests.
"},{"location":"s7_deployment/apis/#requests","title":"Requests","text":"When we are talking about requests, we are essentially talking about the communication method used in client-server types of architectures. As shown in the image below, in this architecture, the client (user) is going to send requests to a server (our machine learning application) and the server will give a response. For example, the user may send a request to get the class of a specific image, which our application will do and then send back the response in terms of a label.
Image creditThe common way of sending requests is called HTTP (Hypertext Transfer Protocol). It is essentially a specification of the intermediary transportation method between the client and server. An HTTP request essentially consists of two parts:
The common request methods are (case sensitive):
You can read more about the different methods here. For most machine learning applications, GET and POST are the core methods to remember. Additionally, if you want to read more about HTTP in general we highly recommend that you go over this comic strip protocol, but the TLDR is that it provides privacy, integrity and identification over the web.
"},{"location":"s7_deployment/apis/#exercises","title":"\u2754 Exercises","text":"We are going to do a couple of exercises on sending requests using requests package to get familiar with the syntax.
Start by installing the `requests`` package
pip install requests\n
Afterwards, create a small script and try to execute the code
import requests\nresponse = requests.get('https://api.github.com/this-api-should-not-exist')\nprint(response.status_code)\n
As you can see from the syntax, we are sending a request using the GET method. This code should return status code 404. Take a look at this page that contains a list of status codes. Next, let's call a page that exists
import requests\nresponse = requests.get('https://api.github.com')\nprint(response.status_code)\n
What is the status code now and what does it mean? Status codes are important when you have an application that is interacting with a server and wants to make sure that it does not fail, which can be done with simple if
statements on the status codes
if response.status_code == 200:\n print('Success!')\nelif response.status_code == 404:\n print('Not Found.')\n
Next, try to call the following
response=requests.get(\"https://api.github.com/repos/SkafteNicki/dtu_mlops\")\n
which gives back a payload. Essentially, payload refers to any additional data that is sent from the client to the server or vice-versa. Try looking at the response.content
attribute. What is the type of this attribute?
You should hopefully observe that the .content
attribute is of type bytes
. It is important to note that this is the standard way of sending payloads to encode them into byte
objects. To get a more human-readable version of the response, we can convert it to JSON format
response.json()\n
Important to remember that a JSON object in Python is just a nested dictionary if you ever want to iterate over the object in some way.
When we use the GET method we can additionally provide a params
argument, that specifies what we want the server to send back for a specific request URL:
response = requests.get(\n 'https://api.github.com/search/repositories',\n params={'q': 'requests+language:python'},\n)\n
Before looking at reponse.json()
can you explain what the code does? You can try looking at this page for help.
Sometimes the content of a page cannot be converted into JSON, because as already stated data is sent as bytes. Say that we want to download an image, which we can do in the following way
import requests\nresponse = requests.get('https://imgs.xkcd.com/comics/making_progress.png')\n
Try calling response.json()
, what happens? Next, try calling response.content
. To get the result in this case we would need to convert from bytes to an image:
with open(r'img.png','wb') as f:\n f.write(response.content)\n
The get
method is the most useful method because it allows us to get data from the server. However, as stated in the beginning multiple request methods exist, for example, the POST method for sending data to the server. Try executing:
pload = {'username':'Olivia','password':'123'}\nresponse = requests.post('https://httpbin.org/post', data = pload)\n
Investigate the response (this is an artificial example because we do not control the server).
Finally, we should also know that requests can be sent directly from the command line using the curl
command. Sometimes it is easier to send a request directly from the terminal and sometimes it is easier to do it from a script.
Make sure you have curl
installed, or else find instruction on installing it. To check call curl -
-help` with the documentation on curl.
To execute requests.get('https://api.github.com')
using curl we would simply do
curl -X GET \"https://api.github.com\"\ncurl -X GET -I \"https://api.github.com\" # if you want the status code\n
Try it yourself.
Try to redo some of the exercises yourself using curl
.
That ends the intro session on requests
. Do not worry if you are still not completely comfortable with sending requests, we are going to return to how we do it in practice when we have created our API. If you want to learn more about the requests
package you can check out this tutorial and if you want to see more examples of how to use curl
you can check out this page
Requests are all about being on the client side of our client-server architecture. We are now going to move on to the server side where we will be learning about writing the APIs that requests can interact with. An application programming interface (API) is essentially the way for the developer (you) tells a user how to use the application that you have created. The API is an abstraction layer that allows the user to interact with our application in the way we want them to interact with it, without the user even having to look at the code.
We can take the API from github as an example https://api.github.com. This API allows any user to retrieve, integrate and send data to Github without ever having to visit their webpage. The API exposes multiple endpoints that have various functions:
and we could go on. However, there may be functionality that Github is not interested in users having access to and they may therefore choose not to have endpoints for specific features (1).
The particular kind of API we are going to work with is called REST API (or RESTful API). The REST API specify specific constraints that a particular API needs to fulfill to be considered RESTful. You can read more about what the six guiding principles behind REST API on this page but one of the most important to have in mind is that the client-server architecture needs to be stateless. This means that whenever a request is send to the server it needs to be self-contained (all information included) and the server cannot rely on any previously stored information from previous requests.
To implement APIs in practise we are going to use FastAPI. FastAPI is a modern, fast (high-performance), web framework for building APIs with Python 3.6+ based on standard Python type hints. FastAPI is only one of many frameworks for defining APIs, however, compared to other frameworks such as Flask and django it offers a sweet spot of being flexible enough to do what you want without having many additional (unnecessary) features.
"},{"location":"s7_deployment/apis/#exercises_1","title":"\u2754 Exercises","text":"The exercises below are a condensed version of this and this tutorial. If you ever need context for the exercises, we can recommend trying to go through these. Additionally, we also provide this solution file that you can look through for help.
Install FastAPI
pip install fastapi\n
This contains the functions, modules, and variables we are going to need to define our interface.
Additionally, also install uvicorn
which is a package for defining low level server applications.
pip install uvicorn[standard]\n
Start by defining a small application like this in a file called main.py
:
from fastapi import FastAPI\napp = FastAPI()\n\n@app.get(\"/\")\ndef read_root():\n return {\"Hello\": \"World\"}\n\n@app.get(\"/items/{item_id}\")\ndef read_item(item_id: int):\n return {\"item_id\": item_id}\n
Importantly here is the use of the @app.get
decorator. What could this decorator refer to? Explain what the two functions are probably doing.
Next lets launch our app. Since we called our script main.py
and we inside the script initialized our API with app = FastAPI
, our application that we want to deploy can be referenced by main:app
:
uvicorn --reload --port 8000 main:app\n
this will launch a server at this page: http://localhost:8000/
. As you will hopefully see, this page will return the content of the root
function, like the image below. Remember to also check the output in your terminal as that will give info on when and how your application is being invoked.
What webpage should you open to get the server to return 1
?
Also checkout the pages: http://localhost:8000/docs
and http://localhost:8000/redoc
. What does these pages show?
The power of the docs
and redoc
pages is that they allow you to easily test your application with their simple UI. As shown in the image below, simply open the endpoint you want to test, click the Try it out
button, input any values and execute it. It will return both the corresponding curl
command for invoking your endpoint, the corresponding URL and response of you application. Try it out.
You can also checkout http://localhost:8000/openapi.json
to check out the schema that is generated which essentially is a json
file containing the overall specifications of your program.
Try to access http://localhost:8000/items/foo
, what happens in this case? When you specify types in your API, FastAPI will automatically do type validation using pydantic, making sure users can only access your API with the correct types. Therefore, remember to include types in your applications!
With the fundamentals in place let's configure it a bit more:
Lets start by changing the root function to include a bit more info. In particular we are also interested in returning the status code so the end user can easily read that. Default status codes are included in the http built-in python package:
from http import HTTPStatus\n\n@app.get(\"/\")\ndef root():\n \"\"\" Health check.\"\"\"\n response = {\n \"message\": HTTPStatus.OK.phrase,\n \"status-code\": HTTPStatus.OK,\n }\n return response\n
try to reload the app and see what is returned now. You should not have to re-launch the app because we initialized the app with the --reload
argument.
When we decorate our functions with @app.get(\"/items/{item_id}\")
, item_id
is in the case what we call a path parameters because it is a parameter that is directly included in the path of our endpoint. We have already seen how we can restrict a path to a single type, but what if we want to restrict it to specific values? This is often the case if we are working with parameters of type str
. In this case we would need to define a enum
:
from enum import Enum\nclass ItemEnum(Enum):\n alexnet = \"alexnet\"\n resnet = \"resnet\"\n lenet = \"lenet\"\n\n@app.get(\"/restric_items/{item_id}\")\ndef read_item(item_id: ItemEnum):\n return {\"item_id\": item_id}\n
Add this API, reload and execute both a valid parameter and a non-valid parameter.
In contrast to path parameters we have query parameters. In the requests exercises we saw an example of this where we were calling https://api.github.com/search/code with the query 'q': 'requests+language:python'
. Any parameter in FastAPI that is not a path parameter, will be considered a query parameter:
@app.get(\"/query_items\")\ndef read_item(item_id: int):\n return {\"item_id\": item_id}\n
Add this API, reload and figure out how to pass in a query parameter.
We have until now worked with the .get
method, but lets also see an example of the .post
method. As already described the POST request method is used for uploading data to the server. Here is a simple app that saves username and password in a database (please never implement this in real life like this):
database = {'username': [ ], 'password': [ ]}\n\n@app.post(\"/login/\")\ndef login(username: str, password: str):\n username_db = database['username']\n password_db = database['password']\n if username not in username_db and password not in password_db:\n with open('database.csv', \"a\") as file:\n file.write(f\"{username}, {password} \\n\")\n username_db.append(username)\n password_db.append(password)\n return \"login saved\"\n
Make sure you understand what the function does and then try to execute it a couple of times to see your database updating. It is important to note that we sometimes in the following exercises use the .get
method and sometimes the .post
method. For our usage it does not really matter.
We are now moving on to figuring out how to provide different standard inputs like text, images, json to our APIs. It is important that you try out each example yourself and in particular you look at the curl
commands that are necessary to invoke each application.
Here is a small application, that takes a single text input
@app.get(\"/text_model/\")\ndef contains_email(data: str):\n regex = r'\\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Z|a-z]{2,}\\b'\n response = {\n \"input\": data,\n \"message\": HTTPStatus.OK.phrase,\n \"status-code\": HTTPStatus.OK,\n \"is_email\": re.fullmatch(regex, data) is not None\n }\n return response\n
What does the application do? Try it out yourself
Let's say we wanted to extend the application to check for a specific email domain, either gmail
or hotmail
. Assume that we want to feed this into our application as a json
object e.g.
{\n \"email\": \"mlops@gmail.com\",\n \"domain_match\": \"gmail\"\n}\n
Figure out how to alter the data
parameter such that it takes in the json
object and make sure to extend the application to check if the email and domain also match. Hint: take a look at this page
Let's move on to an application that requires a file input:
from fastapi import UploadFile, File\nfrom typing import Optional\n\n@app.post(\"/cv_model/\")\nasync def cv_model(data: UploadFile = File(...)):\n with open('image.jpg', 'wb') as image:\n content = await data.read()\n image.write(content)\n image.close()\n\n response = {\n \"input\": data,\n \"message\": HTTPStatus.OK.phrase,\n \"status-code\": HTTPStatus.OK,\n }\n return response\n
A couple of new things are going on here: we use the specialized UploadFile
and File
bodies in our input definition. Additionally, we added the async
/await
keywords. Figure out what everything does and try to run the application (you can use any image file you like).
The above application actually does not do anything. Let's add opencv as a package and let's resize the image. It can be done with the following three lines:
import cv2\nimg = cv2.imread(\"image.jpg\")\nres = cv2.resize(img, (h, w))\n
Figure out where to add them in the application and additionally add h
and w
as optional parameters, with a default value of 28. Try running the application where you specify everything and one more time where you leave out h
and w
.
Finally, let's also figure out how to return a file from our application. You will need to add the following lines:
from fastapi.responses import FileResponse\ncv2.imwrite('image_resize.jpg', res)\nFileResponse('image_resize.jpg')\n
Figure out where to add them to the code and try running the application one more time to see that you get a file back with the resized image.
(Optional) Let's try to figure out how to use FastAPI in a machine learning context. Below is a script that downloads a VisionEncoderDecoder
from huggingface . The model can be used to create captions for a given image. Thus calling
predict_step(['s7_deployment/exercise_files/my_cat.jpg'])\n
returns a list of strings like ['a cat laying on a couch with a stuffed animal']
(try this yourself). Create a FastAPI application that can do inference using this model e.g. it should take in an image, preferably an optional json
object for configuring some of the hyperparameters (like max_length
) and should return a string containing the generated caption.
from transformers import VisionEncoderDecoderModel, ViTFeatureExtractor, AutoTokenizer\nimport torch\nfrom PIL import Image\n\nmodel = VisionEncoderDecoderModel.from_pretrained(\"nlpconnect/vit-gpt2-image-captioning\")\nfeature_extractor = ViTFeatureExtractor.from_pretrained(\"nlpconnect/vit-gpt2-image-captioning\")\ntokenizer = AutoTokenizer.from_pretrained(\"nlpconnect/vit-gpt2-image-captioning\")\n\ndevice = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\nmodel.to(device)\n\ngen_kwargs = {\"max_length\": 16, \"num_beams\": 8, \"num_return_sequences\": 1}\ndef predict_step(image_paths):\n images = []\n for image_path in image_paths:\n i_image = Image.open(image_path)\n if i_image.mode != \"RGB\":\n i_image = i_image.convert(mode=\"RGB\")\n\n images.append(i_image)\n pixel_values = feature_extractor(images=images, return_tensors=\"pt\").pixel_values\n pixel_values = pixel_values.to(device)\n output_ids = model.generate(pixel_values, **gen_kwargs)\n preds = tokenizer.batch_decode(output_ids, skip_special_tokens=True)\n preds = [pred.strip() for pred in preds]\n return preds\n
As the final step, we want to figure out how to include our FastAPI application in a docker container as it will help us when we want to deploy in the cloud because docker as always can take care of the dependencies for our application. For the following set of exercises you can take whatever previous FastAPI application as the base application for the container
Start by creating a requirement.txt
file for your application. You will at least need fastapi
and uvicorn
in the file and we always recommend that you are specific about the version you want to use
fastapi>=0.68.0,<0.69.0\nuvicorn>=0.15.0,<0.16.0\n# add anything else you application needs to be able to run\n
Next, create a Dockerfile
with the following content
FROM python:3.9\nWORKDIR /code\nCOPY ./requirements.txt /code/requirements.txt\n\nRUN pip install --no-cache-dir --upgrade -r /code/requirements.txt\nCOPY ./app /code/app\n\nCMD [\"uvicorn\", \"app.main:app\", \"--host\", \"0.0.0.0\", \"--port\", \"80\"]\n
The above assumes that your file structure looks like this
.\n\u251c\u2500\u2500 app\n\u2502 \u251c\u2500\u2500 __init__.py\n\u2502 \u2514\u2500\u2500 main.py\n\u251c\u2500\u2500 Dockerfile\n\u2514\u2500\u2500 requirements.txt\n
Hopefully, all these steps should look familiar if you already went through module M9, except for maybe the last line. However, this is just the standard way that we have run our FastAPI applications in the last couple of exercises, this time with some extra arguments regarding the ports we allow.
Next, build the corresponding docker image
docker build -t my_fastapi_app .\n
Finally, run the image such that a container is spun up that runs our application. The important part here is to remember to specify the -p
argument (p for port) which should be the same number as the port we have specified in the last line of our Dockerfile.
docker run --name mycontainer -p 80:80 myimage\n
Check that everything is working by going to the corresponding localhost page http://localhost/items/5?q=somequery
(Optional) In module M15 on unittesting you learned how to write unit tests for your data pipeline and model. It should come as no surprise that the same can also be done for your API. Doing so should be able to tell you if your API is working as you expect it to do. The only complication regarding APIs is that you need a server to do testing, and we cannot use uvicorn
for this. Check out this page on how to test FastAPI
application, and add a file called test_api.py
to your tests
folder with appropriate tests for your API.
This ends the module on APIs. If you want to go further in this direction we highly recommend that you check out bentoml which is an API standard that focuses solely on creating easy-to-understand APIs and services for ml-applications. Additionally, we can also highly recommend checking out Postman which can help design, document and in particular test the API you are writing to make sure that it works as expected.
"},{"location":"s7_deployment/cloud_deployment/","title":"M24 - Cloud Deployment","text":""},{"location":"s7_deployment/cloud_deployment/#cloud-deployment","title":"Cloud deployment","text":"Core Module
We are now returning to using the cloud. In this module you should have gone through the steps of having your code in your github repository to automatically build into a docker container, store that, store data and pull it all together to make a training run. After the training is completed you should hopefully have a file stored in the cloud with your trained model weights.
Todays exercises will be about serving those model weights to an end user. We focus on two different ways of deploying our model, Google cloud functions
and Google Vertex AI endpoints
.
Cloud functions are the easiest way to get started with deployment because they are what is called serverless. For serverless deployment we still need a server to do the actual workload, however the core concept is that you do you have to manage the server. Everything is magically taken care of behind the scene.
"},{"location":"s7_deployment/cloud_deployment/#exercises","title":"\u2754 Exercises","text":"Go to the start page of Cloud Functions
. Can be found in the sidebar on the homepage or you can just search for it. Activate the service if not already active.
Click the Create Function
button which should take you to a screen like the image below. Give it a name, set the server region to somewhere close by and change the authentication policy to Allow unauthenticated invocations
so we can access it directly from a browser. Remember to note down the URL of the service somewhere.
On the next page, for Runtime
pick the Python 3.9
option. This will make the inline editor show both a main.py
and requirements.py
file. Look over them. Click the Deploy
button in the lower left corner.
Afterwards you should see a green check mark beside your function meaning that it is deployed. Click the Test function
button which will take you to the testing page.
If you know what the application does, it should come as no surprise that it can run without any input. We therefore just send an empty request by clicking the Test The Function
button. Does the function return the output you expected? Wait for the logs to show up. What do they show?
What should the Triggering event
look like in the testing prompt for the program to respond with
Good day to you sir!\n
Try it out.
Click on the metrics tab. Identify what each panel is showing.
Go to the trigger tab and go to the url for the application.
Checkout the logs tab. You should see that your application have already been invoked multiple times. Also try to execute this command in a terminal:
gcloud functions logs read\n
Next, we are going to create an application that actually takes some input so we can try to send it requests. We provide a very simple sklearn_cloud_function.py script to get started.
Figure out what the script does and run the script. This should create a file with trained model.
Next create a storage bucket and upload the model file to the bucket. You can either do this through the webpage or run the following commands:
gsutil mb gs://<bucket-name> # mb stands for make bucket\ngsutil cp <file-name> gs://<bucket-name> # cp stands for copy\n
check that the file is in the bucket.
Create a new cloud function with the same initial settings as the first one. Choose also the Python 3.9
but this time change code to something that can actually use the model we just uploaded. Here is a code snippet to help you:
from google.cloud import storage\nimport pickle\n\nBUCKET_NAME = ...\nMODEL_FILE = ...\n\nclient = storage.Client()\nbucket = client.get_bucket(BUCKET_NAME)\nblob = bucket.get_blob(MODEL_FILE)\nmy_model = pickle.loads(blob.download_as_string())\n\ndef knn_classifier(request):\n \"\"\" will to stuff to your request \"\"\"\n request_json = request.get_json()\n if request_json and 'input_data' in request_json:\n data = request_json['input_data']\n input_data = list(map(int, data.split(',')))\n prediction = my_model.predict([input_data])\n return f'Belongs to class: {prediction}'\n else:\n return 'No input data received'\n
Some notes: * For locally testing the above code you will need to install the google-cloud-storage
python package * Remember to change the Entry point
* Remember to also fill out the requirements.txt
file. You need at least two packages to run the application with google-cloud-storage
being one of them. * If you deployment fails, try to go to the Logs Explorer
page in gcp
which can help you identify why.
When you have successfully deployed the model, try to make predictions with it.
You can finally try to redo the exercises deploying a Pytorch application. You will essentially need to go through the same steps as the sklearn example, including uploading a trained model to a storage, write a cloud function that loads it and return some output. You are free to choose whatever Pytorch model you want.
Cloud functions are great for simple deployments, that can be encapsulated in a single script with only simple requirements. However, they do not really scale with more advance applications that may depend on multiple programming languages. We are already familiar with how we can deal with this through containers and Cloud Run is the corresponding service in GCP for deploying containers.
"},{"location":"s7_deployment/cloud_deployment/#exercises_1","title":"\u2754 Exercises","text":"We are going to start locally by developing a small app that we can deploy. We provide two small examples to choose from: first a small FastAPI app consisting of this .py file and this dockerfile . Secondly a small streamlit application consisting of just this dockerfile . You are free to choose which application to work with.
Start by going over the files belonging to your choice app and understand what it does.
Next build the docker image belonging to the app
docker build -f <dockerfile> . -t gcp_test_app:latest\n
Next tag and push the image to your container registry
docker tag gcp_test_app gcr.io/<project-id>/gcp_test_app\ndocker push gcr.io/<project-id>/gcp_test_app\n
afterwards check you container registry to check that you have successfully pushed the image.
Next go to Cloud Run
in the cloud console an enable the service
Click the Create Service
button which should bring you to a page similar to the one below
Do the following: * Click the select button, which will bring up all build containers and pick the one you want to deploy. In the future you probably want to choose the Continuously deploy new revision from a source repository such that a new version is always deployed when a new container is build. * Hereafter, give the service a name and select the region. We recommend do choose a region close to you, however it does not really matter that much for our use case * Set the authentication method to Allow unauthenticated invocations such that we can call it without providing credentials. In the future you may only set that authenticated invocations are allowed. * Expand the Container, Connections, Security tab and edit the port such that it matches the port exposed in your chosen application.
Finally, click the create button and wait for the service to be deployed (may take some time).
If you manage to deploy the service you should see a image like this:
You can now access you application by clicking url. This will access the root of your application, so you may need to add /
or /<path>
to the url depending on how the app works.
Everything we just did to deploy an container can be reproduced using the following command:
gcloud run deploy $APP --image $TAG --platform managed --region $REGION --allow-unauthenticated\n
and checked using these two commands
gcloud run services list\ngcloud run services describe $APP --region $REGION\n
feel free to experiment doing the deployment from the command line.
Instead of deploying our docker container using the UI or command line, which is a manual operation, we can do it in a continues manner by using cloudbuild.yaml
file we learned about in the previous section. We just need to add a new step to the file. We provide an example
steps:\n# Build the container image\n- name: 'gcr.io/cloud-builders/docker'\n args: ['build', '-t', 'gcr.io/$PROJECT_ID/<container-name>:lates', '.'] #(1)!\n# Push the container image to Container Registry\n- name: 'gcr.io/cloud-builders/docker'\n args: ['push', 'gcr.io/$PROJECT_ID/<container-name>:latest']\n# Deploy container image to Cloud Run\n- name: 'gcr.io/google.com/cloudsdktool/cloud-sdk'\n entrypoint: gcloud\n args:\n - 'run'\n - 'deploy'\n - '<service-name>'\n - '--image'\n - 'gcr.io/$PROJECT_ID/<container-name>:latest'\n - '--region'\n - '<region>'\n
This line assume you are standing in the root of your repository and is trying to build the docker image specified in a file called Dockerfile
and tag it with the name gcr.io/$PROJECT_ID/my_deployment:latest
. Therefore if you want to point to another dockerfile you need to add -f
option to the command. For example if you want to point to a my_app/my_serving_app.dockerfile
you need to change the line to
args: ['build', '-f', 'my_app/my_serving_app.dockerfile', '-t', 'gcr.io/$PROJECT_ID/my_deployment:lates', '.']\n
where you need to replace <container-name>
with the name of your container, <service-name>
with the name of the service you want to deploy and <region>
with the region you want to deploy to. Afterwards you need to setup a trigger (or reuse the one you already have) to build the container and deploy it to cloud run. Confirm that this works by making a change to your application and pushing it to github and see if the application is updated continuously. For help you can look here for help. If you succeeded, congratulations you have now setup a continues deployment pipeline.
That ends the exercises on deployment. The exercises above is just a small taste of what deployment has to offer. In both sections we have explicitly chosen to work with serverless deployments. But what if you wanted to do the opposite e.g. being the one in charge of the management of the cluster that handles the deployed services? If you are really interested in taking deployment to the next level should get started on kubernetes which is the de-facto open-source container orchestration platform that is being used in production environments. If you want to deep dive we recommend starting here which describes how to make pipelines that are a necessary component before you start to create your own kubernetes cluster.
"},{"location":"s7_deployment/local_deployment/","title":"M23 - Local Deployment","text":""},{"location":"s7_deployment/local_deployment/#local-deployment","title":"Local Deployment","text":"Regardless of your application, model and usecase, the first starting point of serving your model should always be to deploy it locally. The simple reason for that is debugging: if you deploy directly to the cloud you often get less verbose error message and/or the iteration time is much slower because it simply takes much longer time to deploy to the cloud than locally. Locally should therefore always be the first step with any new application.
For this module we are going to focus on deployment of deep learning models, in particular Pytorch models which is used throughout the course. Pytorch has historically been developed for research purposed, where iterating with quick ideas was valued over fast computations. This is evident since Pytorch uses an dynamic graph underneath to represent the computational graph that is being created whenever you are running calculations. The graph is important, as it keeps track on how to do backpropergation though your Pytorch application. However, running code dynamically is notoriously slower than compiling your code before running it. Lets therefore first consider another way of compiling our code.
"},{"location":"s7_deployment/local_deployment/#compilation","title":"Compilation","text":"If you ever coded in any low-level language such as c, fortran or c++ you should be familiar with the term compiling. Compiling is the task of taken a computer program written in one language and translating it into another. In most cases this means taken whatever you have written in your preferred programming language and translating it into machine code that the computer can execute. But what does compilation have to do with coding Pytorch models?
It happens to be that Pytorch
comes with its own compiler that can optimize your model for you. It can be found in the submodule torch.jit
. Jit stands for just-in-time, meaning that compilation runs at the same time we are executing the code. If you know anything about low-level languages such c/c++ you know that we normally compile the code before we run it. With jit
we essentially merges the two phases into one. jit
has two types of compilation modes, called respective script and trace. We are in the exercises going to look at script as it is the easiest to get started with and works without any code changes for nearly all kind of models. If you ever encounter that script does not work for you then trace can be used which is more general.
The major reasons why we want to compile our models with torch.jit
are:
We are here going to look at torch.jit.script
for compiling our code.
To see the difference in the this exercises, we start out with a large model. Download one of the large image classification models from torchvision
such as ResNet-152
. For the purpose of the exercise it does not matter if you work with a random initialized model or a pretrained version.
Next try to script the model using torch.jit.script
. You can find the documentation here.
Just to confirm that by compiling our model using torch.jit.script
did not change the output of our model, try checking that the output of the scripted model corresponds to the output of the non-scripted model. You can do this on a single random datapoint, and you should check that the top-5 prediced classes are the same
assert torch.allclose(unscripted_top5_indices, scripted_top5_indices)\n
Hint: use torch.topk.
Finally, try benchmarking the non-scripted model against the scripted model. I recommend using the built-in benchmarker in Pytorch: torch.utils.benchmark.Timer
, which you can read more about how to use here. Do you see a increase in performance of the scripted model compared to the non-scriptet model. If so, what is the percentage increase in efficiency?
For locally deploying our model we are going to look at Torchserve. Torchserve (illustrated below) is a combined services for packaging and serving multiple Pytorch at the same time.
Image creditBefore we go into details of Torchmetrics, an important question is why we need such an abstraction on top of our developed model. Why can't we just do:
python inference.py --my_model model_checkpoint.pt --new_datapoint img.png\n
If we where never going to do anything else than just calling the model ourself then it is probably not worth adding anything else. However, if we ever want anyone else to interact with our model, we need to comply with standard ways of requesting and sending data. This is especially true when the next step is to start deploying our model in the cloud. Torchserve essentially brings in a inference API on top of our model that turns our model into a client-server type of system: the client (user) is going to send requests to a server (our application) and the server will give an response. The request will be send as a standard HTTP requests which Torchserve will help us decode into a useful input which we can then do inference on and return the result, again as an standardized HTTP response. Torchserve is in that regard similar to FastAPI or Flask if you have ever used one of those frameworks.
Finally, the packaging part of Torchserve is necessary because we cannot give a Torchserve a raw file of trained model weights as these essentially is just a list of floats. We need a file that both contains the model definition and the trained weights, such that the model essentially becomes independent of the python interpreter.
"},{"location":"s7_deployment/local_deployment/#exercises_1","title":"\u2754 Exercises","text":"Torchserve can be a bit rough around the edges but is fairly easy to work with. We are largely going to follow the instructions listed in the readme file for Torchserve. The intention in these exercises is to serve a Resnet type neural network that is trained for classification on ImageNet. Additional documentation can be found here.
Install torchserve
and its dependencies. There are separate instructions on the homepage depending on you are using Windows, WSL or Linux/MAC.
Create a folder called model_store
. This is where we will store the model that we are going to deploy
Try to run the torchserve --model-store model_store
command. If the service starts with no errors, you have installed it correctly and can continue the exercise. Else it is Googling time!
Next lets create a model we can serve. If you have done the previous exercises on compiling using scripting, we highly recommend to initialize and save such model
model = ResnetFromTorchVision(pretrained=True)\nscript_model = torch.jit.script(model)\nscript_model.save('deployable_model.pt')\n
Call the model archiver. We have provided a file called index_to_name.json
that maps from predicted class indices to interpretable class name e.g. 1->\"goldfish\"
. This file should be provided as the extra-files
argument such that the deployed model automatically outputs the class name. Note that this files of course only works for models trained on imagenet.
torch-model-archiver \\\n --model-name my_fancy_model\n --version 1.0 \\\n --serialized-file path/to/serialized_model.pt \\\n --export-path model_store\n --extra-files index_to_name.json\n --handler image_classifier\n
Checkout the model_store
folder. Has the model archiver correctly created a model (with .mar
extension) inside the folder?
Finally, we are going to deploy our model and use it:
Start serving your model in one terminal:
torchserve --start --ncs --model-store model_store --models my_fancy_model=my_fancy_model.mar\n
Next, pick a image that you want to do inference on. It can be any image that you want but try to pick one that actually contains an object from the set of imagenet classes. I have also provided a image of my own cat in the my_cat.jpg
file.
Open another terminal, which we are going to use for inference. The easiest way to do inference is using curl
directly in the terminal but you are also free to experiment with the requests
API directly in python. Using curl
should look something like this
curl http://127.0.0.1:8080/predictions/my_fancy_model -T my_image.jpg\n
Torchserve supports serving multiple models, not just one. Create a new vision model (either another resnet model or something similar), script it, save it, archive it in the save model store folder and then re-run torchserve like this
torchserve --start --ncs --model-store model_store --models all\n
Make sure that you can do inference with both models by calling curl
.
That ends the module on local deployment. Hopefully in this phase you have gained a bit experience with sending HTTP requests as this will be very important in the next module when we will try to deploy the models in the cloud.
"},{"location":"s8_monitoring/","title":"Monitoring","text":"Slides
We have now reached the end of our machine learning pipeline. We have successfully developed, trained and deployed a machine learning model. However, the question then becomes if you can trust that your newly deployed model still works as expected after 1 day without you intervening? What about 1 month? What about 1 year?
There may be corner cases where an ML models is working as expected, but the wast majority of ML models will perform worse over time because they are not generalizing well enough. For example, assume you have just deployed an application that classifies images from phones, when suddenly a new phone comes out with a new kind of sensor that takes images that either have very weird aspect ratio or something else your model is not robust towards. There is nothing wrong with this, you can essentially just retrain your model on new data that accounts for this corner case, however you need a mechanisms that informs you.
This is very monitoring comes into play. Monitoring practices are in charge of collecting any information about your application in some format that can then be analyzed and reacted on. Monitoring is essential to securing the longevity of your applications.
As with many other sub-fields within MLOps we can divide monitoring into classic monitoring and ML specific monitoring. Classic monitoring (known from classic DevOps) is often about
All these are basic information you are interested in regardless of what application type you are trying to deploy. However, then there are ML related monitoring that especially relates data. Take the example above, with the new phone, this we would in general consider to be data drifting problem e.g. the data you are trying to do inference on have drifted away from the distribution of data your model was trained on. Such monitoring problems are unique to machine learning applications and needs to be handled separately.
We are in this session going to see examples of both kinds of monitoring.
Learning objectives
The learning objectives of this session are:
evidently
frameworkData drifting is one of the core reasons for model accuracy degrades over time in production. For machine learning models, data drift is the change in model input data that leads to model performance degradation. In practical terms, this means that the model is receiving input that is outside of the scope that it was trained on, as seen in the figure below. This shows that the underlying distribution of a particular feature has slowly been increasing in value over two years
Image creditIn some cases, it may be that if you normalize some feature in a better way that you are able to generalize your model better, but this is not always the case. The reason for such a drift is commonly some external factor that you essentially have no control over. That really only leaves you with one option: retrain your model on the newly received input features and deploy that model to production. This process is probably going to repeat over the lifetime of your application if you want to keep it up-to-date with the real world.
Image creditWe have now come up with a solution to the data drift problem, but there is one important detail that we have not taken care of: When we should actually trigger the retraining? We do not want to wait around for our model performance to degrade, thus we need tools that can detect when we are seeing a drift in our data.
"},{"location":"s8_monitoring/data_drifting/#exercises","title":"\u2754 Exercises","text":"For these exercises we are going to use the framework Evidently developed by EvidentlyAI. Evidently currently supports both detection for both regression and classification models. The exercises are in large taken from here and in general we recommend if you are in doubt about an exercise to look at the docs for API and examples (their documentation can be a bit lacking sometimes, so you may also have to dive into the source code).
Additionally, we want to stress that data drift detection, concept drift detection etc. is still an active field of research and therefore exist multiple frameworks for doing this kind of detection. In addition to Evidently, we can also mention NannyML, WhyLogs and deepcheck.
Start by install Evidently
pip install evidently\n
you will also need scikit-learn
and pandas
installed if you do not already have it.
Hopefully you already gone through session S7 on deployment. As part of the deployment to GCP functions you should have developed a application that can classify the iris dataset, based on a model trained by this script . We are going to convert this into a FastAPI application for the purpose here:
Convert your GCP function into a FastAPI application. The appropriate curl
command should look something like this:
curl -X 'POST' \\\n 'http://127.0.0.1:8000/iris_v1/?sepal_length=1.0&sepal_width=1.0&petal_length=1.0&petal_width=1.0' \\\n -H 'accept: application/json' \\\n -d ''\n
and the response body should look like this:
{\n \"prediction\": \"Iris-Setosa\",\n \"prediction_int\": 0\n}\n
We have implemented a solution in this file (called v1) if you need help.
Next we are going to add some functionality to our application. We need to add that the input for the user is saved to a database whenever our application is called. However, to not slow down the response to our user we want to implement this as an background task. A background task is a function that should be executed after the user have got their response. Implement a background task that save the user input to a database implemented as a simple .csv
file. You can read more about background tasks here. The header of the database should look something like this:
time, sepal_length, sepal_width, petal_length, petal_width, prediction\n2022-12-28 17:24:34.045649, 1.0, 1.0, 1.0, 1.0, 1\n2022-12-28 17:24:44.026432, 2.0, 2.0, 2.0, 2.0, 1\n...\n
thus both input, timestamp and predicted value should be saved. We have implemented a solution in this file (called v2) if you need help.
Call you API a number of times to generate some dummy data in the database.
Create a new data_drift.py
file where we are going to implement the data drifting detection and reporting. Start by adding both the real iris data and your generated dummy data as pandas dataframes.
import pandas as pd\nfrom sklearn import datasets\nreference_data = datasets.load_iris(as_frame='auto').frame\ncurrent_data = pd.read_csv('prediction_database.csv')\n
if done correctly you will most likely end up with two dataframes that look like
# reference_data\nsepal length (cm) sepal width (cm) petal length (cm) petal width (cm) target\n0 5.1 3.5 1.4 0.2 0\n1 4.9 3.0 1.4 0.2 0\n...\n148 6.2 3.4 5.4 2.3 2\n149 5.9 3.0 5.1 1.8 2\n[150 rows x 5 columns]\n\n# current_data\ntime sepal_length sepal_width petal_length petal_width prediction\n2022-12-28 17:24:34.045649 1.0 1.0 1.0 1.0 1\n...\n2022-12-28 17:24:34.045649 1.0 1.0 1.0 1.0 1\n[10 rows x 5 columns]\n
Standardize the dataframes such that they have the same column names and drop the time column from the current_data
dataframe.
We are now ready to generate some reports about data drifting:
Try executing the following code:
from evidently.report import Report\nfrom evidently.metric_preset import DataDriftPreset\nreport = Report(metrics=[DataDriftPreset()])\nreport.run(reference_data=reference, current_data=current)\nreport.save_html('report.html')\n
and open the generated .html
page. What does it say about your data? Have it drifted? Make sure to poke around to understand what the different plots are actually showing.
Data drifting is not the only kind of reporting evidently can make. We can also get reports on the data quality. Try first adding a few Nan
values to your reference data. Secondly, try changing the report to
from evidently.metric_preset import DataDriftPreset, DataQualityPreset\nreport = Report(metrics=[DataDriftPreset(), DataQualityPreset()])\n
and re-run the report. Checkout the newly generated report. Again go over the generated plots and make sure that it picked up on the missing values you just added.
The final report present we will look at is the TargetDriftPreset
. Target drift means that our model is over/under predicting certain classes e.g. or general terms the distribution of predicted values differs from the ground true distribution of targets. Try adding the TargetDriftPreset
to the Report
class and re-run the analysis and inspect the result. Have your targets drifted?
Evidently reports are meant for debugging, exploration and reporting of results. However, as we stated in the beginning, what we are actually interested in methods automatically detecting when we are beginning to drift. For this we will need to look at Test and TestSuites:
Lets start with a simple test that checks if there are any missing values in our dataset:
from evidently.test_suite import TestSuite\nfrom evidently.tests import TestNumberOfMissingValues\ndata_test = TestSuite(tests=[TestNumberOfMissingValues()])\ndata_test.run(reference_data=reference, current_data=current)\n
again we could run data_test.save_html
to get a nice view of the results (feel free to try it out) but additionally we can also call data_test.as_dict()
method that will give a dict with the test results. What dictionary key contains the if all tests have passed or not?
Take a look at this colab notebook that contains all tests implemented in Evidently. Pick 5 tests of your choice, where at least 1 fails by default and implement them as a TestSuite
. Then try changing the arguments of the test so they better fit your usecase and get them all passing.
(Optional) When doing monitoring in practice, we are not always interested in running on all data collected from our API maybe only the last N
entries or maybe just from the last hour of observations. Since we are already logging the timestamps of when our API is called we can use that for filtering. Implement a simple filter that either takes an integer n
and returns the last n
entries in our database or some datetime t
that filters away observations earlier than this.
Evidently by default only supports structured data e.g. tabular data (so does nearly every other framework). Thus, the question then becomes how we can extend unstructured data such as images or text? The solution is to extract structured features from the data which we then can run the analysis on.
(Optional) For images the simple solution would be to flatten the images and consider each pixel a feature, however this does not work in practice because changes in the individual pixels does not really tell anything about the image. Instead we should derive some feature such as:
These are all numbers that can make up a feature vector for a image. Try out doing this yourself, for example by extracting such features from MNIST and FashionMNIST datasets, and check if you can detect a drift between the two sets.
(Optional) For text a common approach is to extra some higher level embedding such as the very classical GLOVE embedding. Try following this tutorial to understand how drift detection is done on text.
Lets instead take a deep learning based approach to doing this. Lets consider the CLIP model, which is normally used to do image captioning. For our purpose this is perfect because we can use the model to get abstract feature embeddings for both images and text:
from PIL import Image\nimport requests\n# requires transformers package: pip install transformers\nfrom transformers import CLIPProcessor, CLIPModel\n\nmodel = CLIPModel.from_pretrained(\"openai/clip-vit-base-patch32\")\nprocessor = CLIPProcessor.from_pretrained(\"openai/clip-vit-base-patch32\")\n\nurl = \"http://images.cocodataset.org/val2017/000000039769.jpg\"\nimage = Image.open(requests.get(url, stream=True).raw)\n\n# set either text=None or images=None when only the other is needed\ninputs = processor(text=[\"a photo of a cat\", \"a photo of a dog\"], images=image, return_tensors=\"pt\", padding=True)\n\nimg_features = model.get_image_features(inputs['pixel_values'])\ntext_features = model.get_text_features(inputs['input_ids'], inputs['attention_mask'])\n
Both img_features
and text_features
are in this case a (512,)
abstract feature embedding, that should be able to tell us something about our data distribution. Try using this method to extract features on two different datasets like CIFAR10 and SVHN if you want to work with vision or IMDB movie review and Amazon review for text. After extracting the features try running some of the data distribution testing you just learned about.
(Optional) If we have multiple applications and want to run monitoring for each application we often want also the monitoring to be a deployed application (that only we can access). Implement a /monitoring/
endpoint that does all the reporting we just went through such that you have two endpoints:
http://127.0.0.1:8000/iris_infer/?sepal_length=1.0&sepal_width=1.0&petal_length=1.0&petal_width=1.0 # user endpoint\nhttp://127.0.0.1:8000/iris_monitoring/ # monitoring endpoint\n
Our monitoring endpoint should return a HTML page either showing an Evidently report or test suit. Try implementing this endpoint. We have implemented a solution in this file if you need help with how to return an HTML page from a FastAPI application.
As an final exercise, we recommend that you try implementing this to run directly in the cloud. You will need to implement this in a container e.g. GCP Run service because the data gathering from the endpoint should still be implemented as an background task. For this to work you will need to change the following:
Instead of saving the input to a local file you should either store it in GCP bucket or an BigQuery SQL table (this is a better solution, but also out-of-scope for this course)
You can either run the data analysis locally by just pulling from cloud storage predictions and training data or alternatively you can deploy this as its own endpoint that can be invoked. For the latter option we recommend that this should require authentication.
That ends the module on detection of data drifting, data quality etc. If this has not already been made clear, monitoring of machine learning applications is an extremely hard discipline because it is not a clear cut when we should actually respond to feature beginning to drift and when it is probably fine. That comes down to the individual application what kind of rules that should be implemented. Additionally, the tools presented here are also in no way complete and are especially limited in one way: they are only considering the marginal distribution of data. Every analysis that we done have been on the distribution per feature (the marginal distribution), however as the image below show it is possible for data to have drifted to another distribution with the marginal being approximately the same.
There are methods such as Maximum Mean Discrepancy (MMD) tests that are able to do testing on multivariate distributions, which you are free to dive into. In this course we will just always recommend to consider multiple features when doing decision regarding your deployed applications.
"},{"location":"s8_monitoring/monitoring/","title":"M26 - System Monitoring","text":""},{"location":"s8_monitoring/monitoring/#monitoring","title":"Monitoring","text":"In this module we are going to look into more classical monitoring of applications. The key concept we are often working with here is called telemetry. Telemetry in general refer to any automatic measurement and wireless transmission of data from our application. It could be numbers such as:
In general there are three different kinds of telemetry we are interested in:
Name Description Example Purpose Metrics Metrics are quantitative measurements of the system. They are usually numbers that are aggregated over a period of time. E.g. the number of requests per minute. The number of requests per minute. Metrics are used to get an overview of the system. They are often used to create dashboards that can be used to get an overview of the system. Logs Logs are textual or structured records generated by applications. They provide a detailed account of events, errors, warnings, and informational messages that occur during the operation of the system. System logs, error logs Logs are essential for diagnosing issues, debugging, and auditing. They provide a detailed history of what happened in a system, making it easier to trace the root cause of problems and track the behavior of components over time. Traces Traces are detailed records of specific transactions or events as they move through a system. A trace typically includes information about the sequence of operations, timing, and dependencies between different components. Distributed tracing in microservices architecture Traces help in understanding the flow of a request or a transaction across different components. They are valuable for identifying bottlenecks, understanding latency, and troubleshooting issues related to the flow of data or control.We are mainly going to focus in this module on metrics.
"},{"location":"s8_monitoring/monitoring/#local-instrumentator","title":"Local instrumentator","text":"Before we look into the cloud lets at least conceptually understand how a given instance of a app can expose values that we may be interested in monitoring.
The standard framework for exposing metrics is called prometheus. Prometheus is a time series database that is designed to store metrics. It is also designed to be very easy to instrument applications with and it is designed to scale to large amounts of data. The way prometheus works is that it exposes a /metrics
endpoint that can be queried to get the current state of the metrics. The metrics are exposed in a format called prometheus text format.
Start by installing prometheus-fastapi-instrumentator
in python
pip install prometheus-fastapi-instrumentator\n
this will allow us to easily instrument our FastAPI application with prometheus.
Create a simple FastAPI application in a file called app.py
. You can reuse any application from the previous module on APIs. To that file now add the following code:
from prometheus_fastapi_instrumentator import Instrumentator\n\n# your app code here\n\nInstrumentator().instrument(app).expose(app)\n
This will instrument your application with prometheus and expose the metrics on the /metrics
endpoint.
Run the app using uvicorn
server. Make sure that the app exposes the endpoints you expect it too exposes, but make sure you also checkout the /metrics
endpoint.
The metric endpoint exposes multiple /metrics
. Metrics always looks like this:
# TYPE key <type>\nkey value\n
e.g. it is essentially a ditionary of key-value pairs with the added functionality of a <type>
. Look at this page over the different types prometheus metrics can have and try to understand the different metrics being exposed.
Look at the documentation for the prometheus-fastapi-instrumentator
and try to add at least one more metric to your application. Rerun the application and confirm that the new metric is being exposed.
Any cloud system with respect for itself will have some kind of monitoring system. GCP has a service called Monitoring that is designed to monitor all the different services. By default it will monitor a lot of metrics out-of-box. However, the question is if we want to monitor more than the default metrics. The complexity that comes with doing monitoring in the cloud is that we need more than one container. We at least need one container actually running the application that is also exposing the /metrics
endpoint and then we need a another container that is collecting the metrics from the first container and storing them in a database. To implement such system of containers that need to talk to each others we in general need to use a container orchestration system such as Kubernetes. This is out of scope for this course, but we can use a feature of Cloud Run
called sidecar containers
to achieve the same effect. A sidecar container is a container that is running alongside the main container and can be used to do things such as collecting metrics.
Overall we recommend that you just become familiar with the monitoring tab for your cloud run service (see image) above. Try to invoke your service a couple of times and see what happens to the metrics over time.
Try creating a service level objective (SLO). In short a SLO is a target for how well your application should be performing. Click the Create SLO
button and fill it out with what you consider to be a good SLO for your application.
(Optional) To expose our own metrics we need to setup a sidecar container. To do this follow the instructions here. We have setup a simple example that uses fastapi and prometheus that you can find here. After you have correctly setup the sidecar container you should be able to see the metrics in the monitoring tab.
A core problem within monitoring is alert systems. The alert system is in charge of sending out alerts to relevant people when some telemetry or metric we are tracking is not behaving as it should. Alert systems are a subjective choice of when and how many should be send out and in general should be proportional with how important to the of the metric/telemetry. We commonly run into what is referred to the goldielock problem where we want just the right amount of alerts however it is more often the case that we either have
Therefore, setting up proper alert systems can be as challenging as setting up the systems for actually the metrics we want to trigger alerts.
"},{"location":"s8_monitoring/monitoring/#exercises_2","title":"\u2754 Exercises","text":"We are in this exercise going to look at how we can setup automatic alerting such that we get an message every time one of our applications are not behaving as expected.
Go to the Monitoring
service. Then go to Alerting
tab.
Start by setting up an notification channel. A recommend setting up with an email.
Next lets create a policy. Clicking the Add Condition
should bring up a window as below. You are free to setup the condition as you want but the image is one way bo setup an alert that will react to the number of times an cloud function is invoked (actually it measures the amount of log entries from cloud functions).
After adding the condition, add the notification channel you created in one of the earlier steps. Remember to also add some documentation that should be send with the alert to better describe what the alert is actually doing.
When the alert is setup you need to trigger it. If you setup the condition as the image above you just need to invoke the cloud function many times. Here is a small code snippet that you can execute on your laptop to call a cloud function many time (you need to change the url and payload depending on your function):
import time\nimport requests\nurl = 'https://us-central1-dtumlops-335110.cloudfunctions.net/function-2'\npayload = {'message': 'Hello, General Kenobi'}\n\nfor _ in range(1000):\n r = requests.get(url, params=payload)\n
Make sure that you get the alert through the notification channel you setup.
Slides
This module is all about scaling the applications that we are building. We are here going to use a very narrow definition of scaling namely that we want our applications to run faster, however one should note that in general scaling is a much broader term. There are many different ways to scale your applications and we are going to look at three of these related to different tasks machine learning algorithms:
We are going to approach the term scaling from two different angles that both should result in your application running faster. The first approach is levering multiple devices, such as using multiple CPU cores or parallelizing training across multiple GPUs. The second approach is more analytical, were we are actually going to look at how we can design smaller/faster model architectures that runs faster.
It should be noted that this module is specific to working with Pytorch applications. In particular we are going to see how we can both improve base Pytorch code and how to utilize the Pytorch Lightning which we introduced in module M14 on boilerplate to improve the scaling of our applications. If your application is written using another framework we can guarantee that the same techniques in these modules transfers to that framework, but may require you do seek out how to specifically to it.
If you manage to complete all modules in this session, feel free to checkout the extra module on scalable hyperparameter optimization.
Learning objectives
The learning objectives of this session are:
pytorch-lightning
Core Module
One way that deep learning fundamentally changed the way we think about data in machine learning is that more data is always better. This was very much not the case with more traditional machine learning algorithms (random forest, support vector machines etc.) where a pleatau in performance was often reached for a certain amount of data and did not improve if more was added. However, as deep learning models have become deeper and deeper and thereby more and more data hungry performance seems to be ever increasing or at least not reaching a pleatau in the same way as for traditional machine learning.
Image creditAs we are trying to feed more and more data into our models and obvious first question to ask is how to do this in a efficient way. As an general rule of thumb we want the performance bottleneck to be the forward/backward e.g. the actual computation in our neural network and not the data loading. By bottleneck we here refer to the part of our pipeline that is restricting how fast we can process data. If data loading is our bottleneck, then our compute device can sit idle while waiting for data to arrive, which is both inefficient and costly. For example if you are using a cloud provider for training deep learning models, you are paying by the hour per device, and thus not using them fully can be costly in the long run.
In the first set of exercises we are therefore going to focus on distributed data loading i.e. how do load data in parallel to make sure that we always have data ready for our compute devices. We are in the following going to look at what is going on behind the scene when we use Pytorch to parallelize data loading.
"},{"location":"s9_scalable_applications/data_loading/#a-closer-look-on-data-loading","title":"A closer look on Data loading","text":"Before we talk distributed applications it is important to understand the physical layout of a standard CPU (the brain of your computer).
Most modern CPUs is a single chip that consist of multiple cores. Each core can further be divided into threads. In most laptops the core count is 4 and commonly 2 threads per code. This means that the common laptop have 8 threads. The number of threads a compute unit has is important, because that directly corresponds to the number of parallel operations that can be executed i.e. one per thread. In a Python terminal you should be able to get the number of cores in your machine by writing (try it):
import multiprocessing\ncores = multiprocessing.cpu_count()\nprint(f\"Number of cores: {cores}, Number of threads: {2*cores}\")\n
A distributed application is in general any kind of application that parallelizes some or all of it workload. We are in these exercises only focusing on distributed data loading, which happens primarily only on the CPU. In Pytorch
it is easy to parallelize data loading if you are using their dataset/dataloader interface:
from torch.utils.data import Dataset, DataLoader\nclass MyDataset(Dataset):\n def __init__(self, ...):\n # whatever logic is needed to init the data set\n self.data = ...\n\n def __getitem__(self, idx):\n # return one item\n return self.data[idx]\n\ndataset = MyDataset()\ndataloader = Dataloader(\n dataset,\n batch_size=8,\n num_workers=4 # this is the number of threads we want to parallelize workload over\n)\n
Lets take a deep dive into what happens when we request a batch from our dataloader e.g. next(dataloader)
. First we must understand that we have a thread that plays the role of the main and the remaining threads (in the above example we request 4) are called workers. When the dataloader is created, we create this structure and make sure that all threads have a copy of our dataset definition so each can call the __getitem__
method.
Then comes the actual part where we request a batch for data. Assume that we have a batch size of 8 and we do not do any shuffeling. In this step the master thread then distributes the list of requested data points ([0,1,2,3,4,5,6,7]
) to the four worker threads. With 8 indices and 4 workers, each worker will receive 2 indices.
Each worker thread then calls __getitem__
method for all the indices it has received. When all processes are done, the loaded images datapoints gets send back to the master thread collected into a single structure/tensor.
Each arrow is corresponds to a communication between two threads, which is not a free operations. In total to get a single batch (not counting the initial startup cost) in this example we need to do 8 communication operations. This may seem like a small price to pay, but that may not be the case. If the process time of __getitem__
is very low (data is stored in memory, we just need to index to get it) then it does not make sense to use multiprocessing. The computationally saving by doing the look-up operations in parallel is smaller than the communication cost there is between the main thread and the workers. Multiprocessing makes sense when the process time of __getitem__
is high (data is probably stored on the harddrive).
It is this trade-off that we are going to investigate in the exercises.
"},{"location":"s9_scalable_applications/data_loading/#exercises","title":"\u2754 Exercises","text":"This exercise is intended to be done on the labeled faces in the wild (LFW) dataset. The dataset consist images of famous people extracted from the internet. The dataset had been used to drive the field of facial verification, which you can read more about here. We are going imagine that this dataset cannot fit in memory, and your job is therefore to construct a data pipeline that can be parallelized based on loading the raw datafiles (.jpg) at runtime.
Download the dataset and extract to a folder. It does not matter if you choose the non-aligned or aligned version of the dataset.
We provide the lfw_dataset.py
file where we have started the process of defining a data class. Fill out the __init__
, __len__
and __getitem__
. Note that __getitem__
expect that you return a single img
which should be a torch.Tensor
. Loading should be done using PIL Image, as PIL
images is the default input format for torchvision for transforms (for data augmentation).
Make sure that the script runs without any additional arguments
python lfw_dataset.py\n
Visualize a single batch by filling out the codeblock after the first TODO right after defining the dataloader. The visualization should show when launching the script as
python lfw_dataset.py -visualize_batch\n
Hint: this tutorial.
Experiment how the number of workers influences the performance. We have already provide code that will pass over 100 batches from the dataset 5 times and calculate how long time it took, which you can play around with by calling
python lfw_dataset.py -get_timing -num_workers 1\n
Make a errorbar plot with number of workers along the x-axis and the timing along the y-axis. The errorbars should correspond to the standard deviation over the 5 runs. HINT: if it is taking too long to evaluate, measure the time over less batches (set the -batches_to_check
flag). Also if you are not seeing an improvement, try increasing the batch size (since data loading is parallelized per batch).
For certain machines like the Mac with M1 chipset it is necessary to set the multiprocessing_context
flag in the dataloder to \"fork\"
. This essentially tells the dataloader how the worker nodes should be created.
Retry the experiment where you change the data augmentation to be more complex:
lfw_trans = transforms.Compose([\n transforms.RandomAffine(5, (0.1, 0.1), (0.5, 2.0)),\n # add more transforms here\n transforms.ToTensor()\n])\n
by making the augmentation more computationally demanding, it should be easier to get an boost in performance when using multiple workers because the data augmentation is also executed in parallel.
(Optional, requires access to GPU) If your dataset fits in GPU memory it is beneficial to set the pin_memory
flag to True
. By setting this flag we are essentially telling Pytorch that they can lock the data in-place in memory which will make the transfer between the host (CPU) and the device (GPU) faster.
This ends the module on distributed data loading in Pytorch. If you want to go into more details we highly recommend that you read this paper that goes into great details on analyzing on how data loading in Pytorch work and performance benchmarks.
"},{"location":"s9_scalable_applications/distributed_training/","title":"M28 - Distributed Training","text":""},{"location":"s9_scalable_applications/distributed_training/#distributed-training","title":"Distributed Training","text":"In this module we are going to look at distributed training. Distributed training is one of the key ingredients to all the awesome results that deep learning models are producing. For example: Alphafold the highly praised model from Deepmind that seems to have solved protein structure prediction, was trained in a distributed fashion for a few weeks. The training was done on 16 TPUv3s (specialized hardware), which is approximately equal to 100-200 modern GPUs. This means that training Alphafold without distributed training on a single GPU (probably not even possible) would take a couple of years to train! Therefore, it is simply impossible currently to train some of the state-of-the-art (SOTA) models within deep learning currently, without taking advantage of distributed training.
When we talk about distributed training, there are a number of different paradigms that we may use to parallelize our computations
In this module we are going to look at data parallel training, which is the original way of doing parallel training and distributed data parallel training which is an improved version of data parallel. If you want to know more about sharded training which is the newest of the paradigms you can read more about it in this blog post, which describes how sharded can save over 60% of memory used during your training.
Finally, we want to note that for all the exercises in the module you are going to need a multi GPU setup. If you have not already gained access to multi GPU machines on GCP (see the quotas exercises in this module) you will need to find another way of running the exercises. For DTU Students I can recommend checking out this optional module on using the high performance cluster (HPC) where you can get access to multi GPU resources.
"},{"location":"s9_scalable_applications/distributed_training/#data-parallel","title":"Data parallel","text":"While data parallel today in general is seen as obsolete compared to distributed data parallel, we are still going to investigate it a bit since it offers the most simple form of distributed computations in deep learning pipeline.
In the figure below is shown both the forward and backward step in the data parallel paradigm
The steps are the following:
Whenever we try to do forward call e.g. out=model(batch)
we take the batch and divide it equally between all devices. If we have a batch size of N
and M
devices each device will be sent N/M
datapoints.
Afterwards each device receives a copy of the model
e.g. a copy of the weights that currently parametrizes our neural network.
In this step we perform the actual forward pass in parallel. This is the actual steps that can help us scale our training.
Finally we need to send back the output of each replicated model to the primary device.
Similar to the analysis we did of parallel data loading, we cannot always expect that this will actual take less time than doing the forward call on a single GPU. If we are parallelizing over M
devices, we essentially need to do 3xM
communication calls to send batch, model and output between the devices. If the parallel forward call does not outweigh this, then it will take longer.
In addition, we also have the backward path to focus on
As the end of the forward collected the output on the primary device, this is also where the loss is accumulated. Thus, loss gradients are first calculated on the primary device
Next we scatter the gradient to all the workers
The workers then perform a parallel backward pass through their individual model
Finally, we reduce (sum) the gradients from all the workers on the main process such that we can do gradient descend.
One of the big downsides of using data parallel is that all the replicas are destroyed after each backward call. This means that we over and over again need to replicate our model and send it to the devices that are part of the computations.
Even though it seems like a lot of logic is implementing data parallel into your code, in Pytorch we can very simply enable data parallel training by wrapping our model in the nn.DataParallel class.
from torch import nn\nmodel = MyModelClass()\nmodel = nn.DataParallel(model, device_ids=[0, 1]) # data parallel on gpu 0 and 1\npreds = model(input) # same as usual\n
"},{"location":"s9_scalable_applications/distributed_training/#exercises","title":"\u2754 Exercises","text":"Please note that the exercise only makes sense if you have access to multiple GPUs.
Create a new script (call it data_parallel.py
) where you take a copy of model FashionCNN
from the fashion_mnist.py
script. Instantiate the model and wrap torch.nn.DataParallel
around it such that it can be executed in data parallel.
Try to run inference in parallel on multiple devices (pass a batch multiple times and time it) e.g.
import time\nstart = time.time()\nfor _ in range(n_reps):\n out = model(batch)\nend = time.time()\n
Does data parallel decrease the inference time? If no, can you explain why that may be? Try playing around with the batch size, and see if data parallel is more beneficial for larger batch sizes.
It should be clear that there is huge disadvantage of using the data parallel paradigm to scale your applications: the model needs to replicated on each pass (because it is destroyed in the end), which requires a large transfer of data. This is the main problem that distributed data parallel tries to solve.
The two key difference between distributed data parallel and data parallel that we move the model update (the gradient step) to happen on each device in parallel instead of only on the main device. This has the consequence that we do not need to move replicate the model on each step, instead we just keep a local version on each device that we keep updating. The full set of steps (as shown in the figure):
Initialize an exact copy of the model on each device
From disk (or memory) we start by loading data into a section of page-locked host memory per device. Page-locked memory is essentially a way to reverse a piece of a computers memory for a specific transfer that is going to happen over and over again to speed it up. The page-locked regions are loaded with non-overlapping data.
Transfer data from page-locked memory to each device in parallel
Perform forward pass in parallel
Do a all-reduce operation on the gradients. An all-reduce operation is a so call all-to-all operation meaning that all processes send their own gradient to all other processes and also received from all other processes.
Reduce the combined gradient signal from all processes and update the individual model in parallel. Since all processes received the same gradient information, all models will still be in sync.
Thus, in distributed data parallel we here end up only doing a single communication call between all processes, compared to all the communication going on in data parallel. While all-reduce is a more expensive operation that many of the other communication operations that we can do, because we only have to do a single we gain a huge performance boost. Empirically distributed data parallel tends to be 2-3 times faster than data parallel.
However, this performance increase does not come for free. Where we could implement data parallel in a single line in Pytorch, distributed data parallel is much more involving.
"},{"location":"s9_scalable_applications/distributed_training/#exercises_1","title":"\u2754 Exercises","text":"We have provided an example of how to do distributed data parallel training in Pytorch in the two files distributed_example.py
and distributed_example.sh
. You objective is to get a understanding of the necessary components in the script to get this kind of distributed training to work. Try to answer the following questions (HINT: try to Google around):
What is the function of the DDP
wrapper?
What is the function of the DistributedSampler
?
Why is it necessary to call dist.barrier()
before passing a batch into the model?
What does the different environment variables do in the .sh
file
Try to benchmark the runs using 1 and 2 GPUs
The first exercise have hopefully convinced you that it can be quite the trouble writing distributed training applications yourself. Luckily for us, Pytorch-lightning
can take care of this for us such that we do not have to care about the specific details. To get your model training on multiple GPUs you need to change two arguments in the trainer: the accelerator
flag and the gpus
flag. In addition to this, you can read through this guide about any additional steps you may need to do (for many of you, it should just work). Try running your model on multiple GPUs.
Try benchmarking your training using 1 and 2 gpus e.g. try running a couple of epochs and measure how long time it takes. How much of a speedup can you actually get? Why can you not get a speedup of 2?
Inference is task of applying our trained model to some new and unseen data, often called prediction. Thus, scaling inference is different from scaling data loading and training, mainly due to inference normally only using a single data point (or a few). As we can neither parallelize the data loading or parallelize using multiple GPUs (at least not in any efficient way), this is of no use to us when we are doing inference. Secondly, inference is often not something we do on machines that can perform large computations, as most inference today is actually either done on edge devices e.g. mobile phones or in low-cost-low-compute cloud environments. Thus, we need to be smarter about how we scale inference than just throwing more compute at it.
In this module we are going to look at various ways that you can either reduce the size of your model and or make your model faster. Both are important for running inference fast regardless of the setup you are running your model on. We want to note that this is still very much an active area of research and therefore best practices for what to do in a specific situation can change.
"},{"location":"s9_scalable_applications/inference/#choosing-the-right-architecture","title":"Choosing the right architecture","text":"Assume you are starting a completely new project and have to come up with a model architecture for doing this. What is you strategy? The common way to do this, is to look at prior work on similar problems that you are facing and either directly choosing the same architecture or creating some slight variation hereof. This is a great way to get started, but the architecture that you end up choosing may be optimal in terms of performance but not inference speed.
The fact is that not all base architectures are created equal, and a 10K parameter model with one architecture can have significantly different inference speed than another 10K parameter model with another architecture. For example, consider the figure below which compares a number of models from the [timm] package, colored based on their base architecture. The general trend is that the number of images that can be processed by a model per sec (y-axis) is inverse proportional to the number of parameters (x-axis). However, we in general see that convolutional base architectures (conv) are more efficient than transformer (vit) for the same parameter budget.
Image credit"},{"location":"s9_scalable_applications/inference/#exercises","title":"\u2754 Exercises","text":"As dissed in this blogpost the largest increase in inference speed you will see (given some specific hardware) is choosing an efficient model architectures. In the exercises below we are going to investigate the inference speed of different architectures.
Start by checking out this table which contains a list of pretrained weights in torchvision
. Try finding an
model that have in the range of 20-30 mio parameters.
Write a small script that initialize all models and does inference with them. It should look something like this
import time\nfrom torchvision import models\n\nm1 = models.ModelArchitechture1()\nm2 = models.ModelArchitechture2()\nm3 = models.ModelArchitechture3()\n\ninput = torch.randn(100, 3, 256, 256)\n\nfor i, m in enumerate([m1, m2, m3]):\n tic = time.time()\n for _ in range(n_reps):\n _ = m(input)\n toc = time.time()\n print(f\"Model {i} took: {(toc - tic) / n_reps}\")\n
Does the results make sense? Based on the above figure we would expect that efficientnet is faster than resnet, which is faster than the transformer based model. Is this also what you are seeing?
To figure out why one net is more efficient than another we can try to count the operations each network need to do for inference. A operation here we can define as a FLOP (floating point operation) which is any mathematical operation (such as +, -, *, /) or assignment that involves floating-point numbers. Luckily for us someone has already created a python package for calculating this in pytorch: ptflops
Install the package
pip install ptflops\n
Try calling the get_model_complexity_info
function from the ptflops
package on the networks from the previous exercise. What are the results?
In the table from the initial exercise, you could also see the overall performance of each network on the Imagenet-1K dataset. Given this performance, the inference speed, the flops count what network would you choose to use in a production setting? Discuss when choosing one over another should be considered.
Quantization is a technique where all computations are performed with integers instead of floats. We are essentially taking all continuous signals and converting them into discretized signals.
Image creditAs discussed in this blogpost series, while float
(32-bit) is the primarily used precision in machine learning because is strikes a good balance between memory consumption, precision and computational requirement it does not mean that during inference we can take advantage of quantization to improve the speed of our model. For instance:
Floating-point computations are slower than integer operations
Recent hardware have specialized hardware for doing integer operations
Many neural networks are actually not bottlenecked by how many computations they need to do but by how fast we can transfer data e.g. the memory bandwidth and cache of your system is the limiting factor. Therefore working with 8-bit integers vs 32-bit floats means that we can approximately move data around 4 times as fast.
Storing models in integers instead of floats save us approximately 75% of the ram/harddisk space whenever we save a checkpoint. This is especially useful in relation to deploying models using docker (as you hopefully remember) as it will lower the size of our docker images.
But how do we convert between floats and integers in quantization? In most cases we often use a linear affine quantization:
$$ x_{int} = \\text{round}\\left( \\frac{x_{float}}{s} + z \\right) $$
where $s$ is a scale and $z$ is the so called zero point. But how does to doing inference in a neural network. The figure below shows all the conversations that we need to make to our standard inference pipeline to actually do computations in quantized format.
Image credit"},{"location":"s9_scalable_applications/inference/#exercises_1","title":"\u2754 Exercises","text":"Lets look at how quantized tensors look in Pytorch
Start by creating a tensor that contains both random numbers
Next call the torch.quantize_per_tensor
function on the tensor. What does the quantized tensor look like? How does the values relate to the scale
and zero_point
arguments.
Finally, try to call the .dequantize()
method on the tensor. Do you get a tensor back that is close to what you initially started out with.
As you hopefully saw in the first exercise we are going to perform a number of rounding errors when doing quantization and naively we would expect that this would accumulate and lead to a much worse model. However, in practice we observe that quantization still works, and we actually have a mathematically sound reason for this. Can you figure out why quantization still works with all the small rounding errors? HINT: it has to do with the central limit theorem
Lets move on to quantization of our model. Follow this tutorial from Pytorch on how to do quantization. The goal is to construct a model model_fc32
that works on normal floats and a quantized version model_int8
. For simplicity you can just use one of the models from the tutorial.
Lets try to benchmark our quantized model and see if all the trouble that we went through actually paid of. Also try to perform the benchmark on the non-quantized model and see if you get a difference. If you do not get an improvement, explain why that may be.
Pruning is another way for reducing the model size and maybe improve performance of our network. As the figure below illustrates, in pruning we are simply removing weights in our network that we do not consider important for the task at hand. By removing, we here mean that the weight gets set to 0. There are many ways to determine if a weight is important but the general rule that the importance of a weight is proportional to the magnitude of a given weight. This makes intuitively sense, since weights in all linear operations (fully connected or convolutional) are always multiplied onto the incoming value, thus a small weight means a small outgoing activation.
Image credit"},{"location":"s9_scalable_applications/inference/#exercises_2","title":"\u2754 Exercises","text":"We provide a start script that implements the famous LeNet in this file. Open and run it just to make sure that you know the network.
Pytorch have already some pruning methods implemented in its package. Import the prune
module from torch.nn.utils
in the script.
Try to prune the weights of the first convolutional layer by calling
prune.random_unstructured(module_1, name=\"weight\", amount=0.3) # (1)!\n
Try printing the named_parameters
, named_buffers
before and after the module is pruned. Can you explain the difference and what is the connection to the module_1.weight
attribute.
Try pruning the bias of the same module this time using the l1_unstructured
function from the pruning module. Again check the named_parameters
, named_buffers
argument to make sure you understand the difference between L1 pruning and unstructured pruning.
Instead of pruning only a single module in the model lets try pruning the hole model. To do this we just need to iterate over all named_modules
in the model like this:
for name, module in new_model.named_modules():\n prune.l1_unstructured(module, name='weight', amount=0.2)\n
But what if we wanted to apply different pruning to different layers. Implement a pruning scheme where
amount=0.2
amount=0.4
Print print(dict(new_model.named_buffers()).keys())
after the pruning to confirm that all weights have been correctly pruned.
The pruning we have looked at until know have only been local in nature e.g. we have applied the pruning independently for each layer, not accounting globally for how much we should actually prune. As you may realize this can quickly lead to an network that is pruned too much. Instead, the more common approach is too prune globally where we remove the smallest X
amount of connections:
Start by creating a tuple over all the weights with the following format
parameters_to_prune = (\n (model.conv1, 'weight'),\n # fill in the rest of the modules yourself\n (model.fc3, 'weight'),\n)\n
The tuple needs to have length 5. Challenge: Can you construct the tuple using for
loops, such that the code works for arbitrary size networks?
Next prune using the global_unstructured
function to globally prune the tuple of parameters
prune.global_unstructured(\n parameters_to_prune,\n pruning_method=prune.L1Unstructured,\n amount=0.2,\n)\n
Check that the amount that have been pruned is actually equal to the 20% specified in the pruning. We provide the following function that for a given submodule (for example model.conv1
) computes the amount of pruned weights
def check_prune_level(module: nn.Module):\n sparsity_level = 100 * float(torch.sum(module.weight == 0) / module.weight.numel())\n print(f\"Sparsity level of module {sparsity_level}\")\n
With a pruned network we really want to see if all our effort actually ended up with a network that is faster and/or smaller in memory. Do the following to the globally pruned network from the previous exercises:
First we need to make the pruning of our network permanent. Right now it is only semi-permanent as we are still keeping a copy of the original weights in memory. Make the change permanent by calling prune.remove
on every pruned module in the model. Hint: iterate over the parameters_to_prune
tuple.
Next try to measure the time of a single inference (repeated 100 times) for both the pruned and non-pruned network
import time\ntic = time.time()\nfor _ in range(100):\n _ = network(torch.randn(100, 1, 28, 28))\ntoc = time.time()\nprint(toc - tic)\n
Is the pruned network actually faster? If not can you explain why?
Next lets measure the size of our network (called pruned_network
) and a freshly initialized network (called network
):
torch.save(pruned_network.state_dict(), 'pruned_network.pt')\ntorch.save(network.state_dict(), 'network.pt')\n
Lookup the size of each file. Are the pruned network actually smaller? If not can you explain why?
Repeat the last exercise, but this time start by converting all pruned weights to sparse format first by calling the .to_sparse()
method on each pruned weight. Is the saved model smaller now?
This ends the exercises on pruning. As you probably realized in the last couple of exercises, then pruning does not guarantee speedups out of the box. This is because linear operations in Pytorch does not handle sparse structures out of the box. To actually get speedups we would need to deep dive into the sparse tensor operations, which again does not even guarantee that a speedup because the performance of these operations depends on the sparsity structure of the pruned weights. Investigating this is out of scope for these exercises, but we highly recommend checking it out if you are interested in sparse networks.
"},{"location":"s9_scalable_applications/inference/#knowledge-distillation","title":"Knowledge distillation","text":"Knowledge distillation is somewhat similar to pruning, in the sense that it tries to find a smaller model that can perform equally well as a large model, however it does so in a completely different way. Knowledge distillation is a model compression technique that builds on the work of Bucila et al. in which we try do distill/compress the knowledge of a large complex model (also called the teacher model) into a simpler model (also called the student model).
The best known example of this is the DistilBERT model. The DistilBERT model is a smaller version of the large natural-language procession model Bert, which achieves 97% of the performance of Bert while only containing 40% of the weights and being 60% faster. You can see in the figure below how it is much smaller in size compared to other models developed at the same time.
Image creditKnowledge distillation works by assuming we have a big teacher that is already performing well that we want to compress. By running our training set through our large model we get a softmax distribution for each and every training sample. The goal of the students, is to both match the original labels of the training data but also match the softmax distribution of the teacher model. The intuition behind doing this, is that teacher model needs to be more complex to learn the complex inter-class relasionship from just (one-hot) labels. The student on the other hand gets directly feed with softmax distributions from the teacher that explicit encodes this inter-class relasionship and thus does not need the same capasity to learn the same as the teacher.
Image credit"},{"location":"s9_scalable_applications/inference/#exercises_3","title":"\u2754 Exercises","text":"Lets try implementing model distillation ourself. We are going to see if we can achieve this on the cifar10 dataset. Do note that exercise below can take quite long time to finish because it involves training multiple networks and therefore involve some waiting.
Start by install the transformers
and datasets
packages from Huggingface
pip install transformers\npip install datasets\n
which we are going to download the cifar10 dataset and a teacher model.
Next download the cifar10 dataset
from datasets import load_dataset\ndataset = load_dataset(\"cifar10\")\n
Next lets initialize our teacher model. For this we consider a large transformer based model:
from transformers import AutoFeatureExtractor, AutoModelForImageClassification\nextractor = AutoFeatureExtractor.from_pretrained(\"aaraki/vit-base-patch16-224-in21k-finetuned-cifar10\")\nmodel = AutoModelForImageClassification.from_pretrained(\"aaraki/vit-base-patch16-224-in21k-finetuned-cifar10\")\n
To get the logits (un-normalized softmax scores) from our teacher model for a single datapoint from the training dataset you would extract it like this:
sample_img = dataset['train'][0]['img']\npreprocessed_img = extractor(dataset['train'][0]['img'], return_tensors='pt')\noutput = model(**preprocessed_img)\nprint(output.logits)\n# tensor([[ 3.3682, -0.3160, -0.2798, -0.5006, -0.5529, -0.5625, -0.6144, -0.4671, 0.2807, -0.3066]])\n
Repeat this process for the hole training dataset and store the result somewhere.
Implement a simple convolutional model. You can create a custom one yourself or use a small one from torchvision
.
Train the model on cifar10 to convergence, so you have a base result on how the model is performing.
Redo the training, but this time add knowledge distillation to your training objective. It should look like this:
for batch in dataset:\n # ...\n img, target, teacher_logits = batch\n preds = model(img)\n loss = torch.nn.functional.cross_entropy(preds, target)\n loss_teacher = torch.nn.functional.cross_entropy(preds, teacher_logits)\n loss = loss + loss_teacher\n loss.backward()\n # ...\n
Compare the final performance obtained with and without knowledge distillation. Did the performance improve or not?
This ends the module on scaling inference in machine learning models.
"},{"location":"tools/","title":"Tools","text":"Just a collection of tools and scripts for running the course.
"}]} \ No newline at end of file +{"config":{"lang":["en"],"separator":"[\\s\\-]+","pipeline":["stopWordFilter"]},"docs":[{"location":"","title":"Home","text":"Machine Learning Operations
Repository for course 02476 at DTU.
Checkout the homepage!
"},{"location":"#i-course-information","title":"\u2139\ufe0f Course information","text":"
Course responsible
Postdoc Nicki Skafte Detlefsen, nsde@dtu.dk
Professor S\u00f8ren Hauberg, sohau@dtu.dk
5 ECTS (European Credit Transfer System), corresponding to 140 hours of work
Recommended prerequisites: DTU course 02456 (Deep Learning) or experience with the following topics:
General understanding of machine learning (datasets, probability, classifiers, overfitting etc.)
Start by cloning or downloading this repository
git clone https://github.com/SkafteNicki/dtu_mlops\n
If you do not have git installed (yet) we will touch upon it in the course. The folder will contain all the exercise material for this course and lectures. Additionally, you should join our Slack channel which we use for communication. The link may be expired, write to me.
"},{"location":"#course-organization","title":"\ud83d\udcc2 Course organization","text":"We highly recommend that when going through the material you use the homepage which is the corresponding Github pages version of this repository that is more nicely rendered, that also includes some special HTML magic provided by Material for MkDocs.
The course is divided into sessions, denoted by capital S, and modules, denoted by capital M. A session corresponds to a full day of work if you are following the course, meaning approximately 9 hours of work. Each session (S) corresponds to a topic within MLOps and consists of multiple modules (M) that each cover a tool within the session.
Importantly we differ between core modules and optional modules. Core modules will be marked by
Core Module
at the top of their corresponding page. Core modules are important to go through to be able to pass the course. You are highly recommended to still do the optional modules.
"},{"location":"#mlops-what-is-it","title":"\ud83c\udd92 MLOps: What is it?","text":"Machine Learning Operations (MLOps) is a rather new field that has seen its uprise as machine learning and particularly deep learning has become a widely available technology. The term itself is a compound of \"machine learning\" and \"operations\" and covers everything that has to do with the management of the production ML lifecycle.
The lifecycle of production ML can largely be divided into three phases:
Design: The initial phase starts with an investigation of the problem. Based on this analysis, several requirements can be prioritized for what we want our future model to do. Since machine learning requires data to be trained, we also investigate in this step what data we have and if we need to source it in some other way.
Model development: Based on the design phase we can begin to conjure some machine learning algorithms to solve our problems. As always, the initial step often involves doing some data analysis to make sure that our model is learning the signal that we want it to learn. Secondly, is the machine learning engineering phase, where the particular model architecture is chosen. Finally, we also need to do validation and testing to make sure that our model is generalizing well.
Operations: Based on the model development phase, we now have a model that we want to use. The operations are where create an automatic pipeline that makes sure that whenever we make changes to our codebase they get automatically incorporated into our model, such that we do not slow down production. Equally important is the ongoing monitoring of already deployed models to make sure that they behave exactly as we specified them.
It is important to note that the three steps are a cycle, meaning that when you have successfully deployed a machine learning model that is not the end of it. Your initial requirements may change, forcing you to revisit the design phase. Some new algorithms may show promising results, so you revisit the model development phase to implement this. Finally, you may try to cut the cost of running your model in production, making you revisit the operations phase, and trying to optimize some steps.
The focus in this course is particularly on the Operations part of MLOps as this is what many data scientists are missing in their toolbox to take all the knowledge they have about data processing and model development into a production setting.
"},{"location":"#learning-objectives","title":"\u2754 Learning objectives","text":"General course objective
Introduce the student to a number of coding practices that will help them organization, scale, monitor and deploy machine learning models either in a research or production setting. To provide hands-on experience with a number of frameworks, both local and in the cloud, for doing large scale machine learning models.
This includes:
Additional reading resources (in no particular order):
Ref 1 Introduction blog post for those who have never heard about MLOps and want to get an overview.
Ref 2 Great document from Google about the different levels of MLOps.
Ref 3 Another introduction to the principles of MLOps and the different stages of MLOps.
Ref 4 Great paper about the technical depth in machine learning.
Ref 5 Interview study that uncovers many of the pain points that ML engineers go through when doing MLOps.
Other courses with content similar to this:
Made with ML. Great online MLOps course that also covers additional topics on the foundations of working with ML.
Full stack deep learning. Another MLOps online course going through the whole developer pipeline.
MLOps Zoomcamp. MLOps online course that includes many of the same topics.
If you want to contribute to the course, we are happy to have you! Anything from fixing typos to adding new content is welcome. For building the course material locally, it is a simple two-step process:
pip install -r requirements.txt\nmkdocs serve\n
Which will start a local server that you can access at localhost:8000
and will automatically update when you make changes to the course material. When you have something that you want to contribute, please make a pull request.
I highly value open source, and the content of this course is therefore free to use under the Apache 2.0 license. If you use parts of this course in your work, please cite using:
@misc{skafte_mlops,\n author = {Nicki Skafte Detlefsen},\n title = {Machine Learning Operations},\n howpublished = {\\url{https://github.com/SkafteNicki/dtu_mlops}},\n year = {2024}\n}\n
"},{"location":"challenges/","title":"Challenges","text":"If you have managed to go through all other material, congratulations, you are already a good way to becoming an MLOps engineer with a great overview of tools, concepts and techniques within the field. Below are listed some technical hard problems regarding MLOps. These are meant as inspiration to get you to deep dive more into using all the cloud services that gcp
offers. You are also free to continue work on your project.
Currently testing takes place in Github, but it should come as no surprise that gcp
can also take care of this. Implementing testing on gcp
. This blogpost can probably help.
In the lectures we setup cloud build to automatically build a docker container for training whenever we pushed code to our github repository. However, we also setup CI testing in github. If tests are failing on github the building of the docker image is still being done, essentially wasting our precious cloud credit. Setup a system so cloud building only commence when all tests are passing.
Authenticating between gcp
, wandb
and dvc
can be tricky to do in a secure way. Figure out how to use the Secret Manager in gcp
to pass secrets e.g. API keys during the build process of docker images. This page may help
We have already done deployment through Cloud Functions
. The native extension to cloud functions is the service Cloud Run
which allows for more than just code snippets to be deployed. Checkout this service and try to deploy a container using it.
All deployments we have done in the course have been serverless, because it makes it easier for us to focus on the actual application we are trying to deploy instead of focusing on server management. That said, going through the trouble of using a server orchestrator yourself can be worth it in many situations. Figure out how to use kubernetes in gcp
. It will involve getting familiar with the kubernetes API and probably also kubeflow for managing pipelines on the server.
Vertex AI is the newest ML service on gcp
. It combines many of the features of the AI platform service you have already used with the AutoML service. Figure out how to use Vertex AI service to either train a custom model or use their AutoML feature. This blogpost can be a good place to start.
If you want different services to be able to talk to each other the correct way is to setup a system using Pub and Sub (publish and subscription) service in gcp
. Essentially it allows a service to publish a message and other services to subscribe and react to it. For example the AI platform could publish a message every time a model was done training and cloud build could subscribe to that, automatically staring to build a docker image using the trained model. Investigate Pub and Sub and try to make two services talk to each other.
In the deployment exercises you probably looked at least once on the logs. We can automate what we do with the logs using the Logs Explorer service, which collects all logs from all services that you are using. Setup Logs routing for one of your deployed services to your cloud storage. Afterwards setup a VM that consumes the logs and accumulate them.
For further questions, please contact Nicki.
"},{"location":"faq/#is-it-possible-to-attend-the-course-fully-online","title":"Is it possible to attend the course fully online \u2754","text":"Mostly yes. All exercises are provided online and lectures will be recorded and streamed. However, do note that
Overall we try to support flexible learning as much as possible with some limitations.
"},{"location":"faq/#what-are-the-prerequisites-for-taking-this-course","title":"What are the prerequisites for taking this course \u2754","text":"We recommend that you have a basic understanding of machine learning concepts such as what a dataset is, what probabilities are, what a classifier is, what overfitting means etc. This corresponds to the curriculum covered in course 02450. The actual focus of the course is not on machine learning models, but we will be using these basic concepts throughout the exercises.
Additionally, we recommend basic knowledge about deep learning and how to code in Pytorch, corresponding to the curriculum covered in 02456. From prior experience, we know that not all students have gained knowledge about deep learning models before this course, and we will be covering the basics of how to code in PyTorch in one of the first modules of the course to get everyone up to speed.
"},{"location":"faq/#i-will-be-missing-x-days-of-the-course-will-that-be-a-problem","title":"I will be missing X days of the course, will that be a problem \u2754","text":"Depends. The course is fairly intensive, with most students working from 9-17 every day. If you already know that you will be missing X days of the course, then I highly recommend that you go through some of the first sessions before the course starts to give yourself a bit of breathing room. If you are not able to do so, please be aware that an additional effort may be needed from you to keep up with your fellow students.
"},{"location":"faq/#how-many-should-we-be-in-a-group-for-the-projects","title":"How many should we be in a group for the projects \u2754","text":"Between 3 and 5. The projects are designed to be done in groups meaning that we intentionally make them too big for one person to do alone. Luckily, a lot of the work that you need to do can be done in parallel, so it is not as bad as it sounds.
"},{"location":"faq/#when-will-the-exam-take-place","title":"When will the exam take place \u2754","text":"The oral part of the exam, which is a small project demo, always falls on the last day of the course. For January 2024, this means the 19th. The written part which is a small project report, should be handed in at midnight on the final course day.
"},{"location":"faq/#where-can-i-find-information-regarding-the-exam","title":"Where can I find information regarding the exam \u2754","text":"Look at the bottom of this page. Details will be updated as we get closer to the exam date.
"},{"location":"faq/#can-i-use-chatgpt-or-similar-tools-for-the-exercises-project-exam-report-coding-writing","title":"Can I use ChatGPT or similar tools for the exercises, project, exam report (coding + writing) \u2754","text":"Yes, yes, and yes, but remember that its a tool and you need to validate the output before using it. We would prefer for the exam report that you formulate the answers in your own words because it is intended for you do describe what you have been doing in your project. The I in LLM stands for intelligence.
"},{"location":"faq/#i-am-a-foreign-student-and-my-home-university-doesnt-accept-passnot-pass-what-can-i-do","title":"I am a foreign student and my home university doesn't accept pass/not pass, what can I do \u2754","text":"We can give a grade on the Danish 7-point grading scale for foreign students who need it, where their home university does not accept pass/no-pass. You need to contact the course responsible Nicki within the first week of the course to request this. Secondly, make sure to also inform us about it during the oral part of the exam because we need to ask you additional questions to be able to give an exact grade.
"},{"location":"faq/#i-am-a-euroteq-student-any-special-rules-for-me","title":"I am a EuroTEQ student, any special rules for me \u2754","text":"You will be allowed to attend the oral part of the exam online and we will provide a special Slack channel for you, trying to make sure that you get the same help as students from DTU who can attend the course on campus.
"},{"location":"overview/","title":"Summary of course content","text":"There are a lot of moving parts in this course, so it may be hard to understand how it all fits together. This page provides a summary of the frameworks in this course e.g. the stack of tools used. In the figure below we have provided an overview on how the different tools of the course interacts with each other. The table after the figure provides a short description of each of the parts.
The MLOps stack in the course. This is just an example of one stack, and depending on your use case you may want to use a different stack of tools that better fits your needs. Regardless of the stack, the principles of MLOps are the same. Framework Description Pytorch is the backbone of our code, it provides the computational engine and the data structures that we need to define our data structures. Pytorch lightning is a framework that provides a high-level interface to Pytorch. It provides a lot of functionality that we need to train our models, such as logging, checkpointing, early stopping, etc. such that we do not have to implement it ourselves. It also allows us to scale our models to multiple GPUs and multiple nodes. We control the dependencies and python interpreter using Conda that enables us to construct reproducible virtual environments For configuring our experiments we use Hydra that allows us to define a hierarchical configuration structure config files Using Weights and Bias allows us to track and log any values and hyperparameters for our experiments Whenever we run into performance bottlenecks with our code we can use the Profiler to find the cause of the bottleneck When we run into bugs in our code we can use the Debugger to find the cause of the bug For organizing our code and creating templates we can use Cookiecutter Docker is a tool that allows us to create a container that contains all the dependencies and code that we need to run our code For controlling the versions of our data and synchronization between local and remote data storage, we can use DVC that makes this process easy For version control of our code we use Git (in complement with Github) that allows multiple developers to work together on a shared codebase We can use Pytest to write unit tests for our code, to make sure that new changes to the code does break the code base For linting our code and keeping a consistent coding style we can use tools such as Pylint and Flake8 that checks our code for common mistakes and style issues For running our unit tests and other checks on our code in a continues manner e.g. after we commit and push our code we can use Github actions that automate this process Using Cloud build we can automate the process of building our docker images and pushing them to our container registry Container registry is a service that allows us to store our docker images for later use by other services For storing our data and trained models we can use Cloud storage that provides a scalable and secure storage solution For general compute tasks we can use Compute engine that provides a scalable and secure compute solution For training our experiments in a easy and scalable manner we can use Vertex AI For creating a REST API for our model we can use FastAPI that provides a high-level interface for creating APIs For simple deployments of our code we can use Cloud functions that allows us to run our code in response to events through simple python functions For more complex deployments of our code we can use Cloud run that allows us to run our code in response to events through docker containers Cloud monitoring gives us the tools to keep track of important logs and errors from the other cloud services For monitoring our deployed model is experiencing any drift we can use Evidently AI that provides a framework and dashboard for monitoring drift For monitoring the telemetry of our deployed model we can use OpenTelemetry that provides a standard for collecting and exporting telemetry data"},{"location":"projects/","title":"Project work","text":"Slides
Approximately 1/3 of the course time is dedicated to doing project work. The projects will serve as the basis of your exam. In the project, you will essentially re-apply everything that you learn throughout the course to a self chosen project. The overall goals with the project is:
In the projects you are free to work on whatever problem that you want. That said, we have a specific requirement, that you need to incorporate some third-party framework into your project. If you want inspiration for projects, here are some examples
Classification of tweets
Translating from English to German
Classification of scientific papers
Classification of rice types from images
We hope most students will be able to form groups by themselves. Expected group size is between 3 and 5. If you are not able to form a group, please make sure to post in the #looking-for-group
channel on Slack or make sure to be present on the 4th day of the course (the day before the project work starts) where we will help students that have not found a group yet.
We strive to keep the tools thought in this course as open-source as possible. The great thing about the open-source community is that whatever problem you are working on, there is probably some package out there that can get you at least 10% of the way. For the project, we want to enforce this point and you are required to include some third-party package, that is neither Pytorch or one of the tools already covered in the course, into your project.
If you have no idea what framework to include, the Pytorch ecosystem is a great place for finding open-source frameworks that can help you accelerate your own projects where Pytorch is the backengine. All tools in the ecosystem should work greatly together with Pytorch. However, it is important to note that the ecosystem is not a complete list of all the awesome packages that exist to extend the functionality of Pytorch. If you are still missing inspiration for frameworks to use, we highly recommend these three that have been used in previous years of the course:
PyTorch Image Models. PyTorch Image Models (also known as TIMM) is the absolutely most used computer vision package (maybe except for torchvision
). It contains models, scripts and pre trained for a lot of state-of-the-art image models within computer vision.
Transformers. The Transformers repository from the Huggingface group focuses on state-of-the-art Natural Language Processing (NLP). It provides many pre-trained model to perform tasks on texts such as classification, information extraction, question answering, summarization, translation, text generation, etc in 100+ languages. Its aim is to make cutting-edge NLP easier to use for everyone.
Pytorch-Geometric. PyTorch Geometric (PyG) is a geometric deep learning. It consists of various methods for deep learning on graphs and other irregular structures, also known as geometric deep learning, from a variety of published papers.
Each project day is fully dedicated to project work, except for maybe external inspirational lectures in the morning. The group decides exactly where they want to work on the project, how they want to work on the project, how do distribute the workload etc. We actually encourage strongly to parallelize work during the project, because there are a lot of tasks to do, but it it is important that all group members at least have some understanding of the whole project.
Remember that the focus of the project work is not to demonstrate that you can work with the biggest and baddest deep learning model, but instead that you show that you can incorporate the tools that are taught throughout the course in a meaningful way.
Also note that the project is not expected to be very large in scope. It may simply be that you want to train X model on Y data. You will approximately be given 4 full days to work on the project. It is better that you start out with a smaller project and then add complexity along the way if you have time.
"},{"location":"projects/#day-1","title":"Day 1","text":"The first project days is all about getting started on the projects and formulating exactly what you want to work on as a group.
Start by brainstorming projects! Try to figure out exactly what you want to work with and begin to investigate what third party package that can support the project.
When you have come up with an idea, write a project description. The description is the delivery for today and should be at least 300 words. Try to answer the following questions in the description:
(Optional) If you want to think more about the product design of your project, feel free to fill out the ML canvas (or part of it). You can read more about the different fields on canvas here.
After having done the product description, you can start on the actual coding of the project. In the next section, a to-do list is attached that summaries what we are doing in the course. You are NOT expected to fulfill all bullet points from week 1 today.
The project description will serve as an guideline for us at the exam that you have somewhat reached the goals that you set out to do. By the end of the day, you should commit your project description to the README.md
file belonging to your project repository. If you filled out the ML canvas, feel free to include that as part of the README.md
file. Also remember to commit whatever you have done on the project until now. When you have done this, go to DTU Learn and hand-in (as a group) the link to your github repository as an assignment.
We will briefly (before next Monday) look over your github repository and project description to check that everything is fine. If we have any questions/concerns we will contact you.
"},{"location":"projects/#day-2","title":"Day 2","text":"The goal for today is simply to continue working on your project. Start with bullet points in the checklist from week 1 and continue with bullet points for week 2.
"},{"location":"projects/#day-3","title":"Day 3","text":"Continue working on your project, today you should hopefully focus on the bullet points in the checklist from week 2. There is no delivery for this week, but make sure that you have committed all your progress at the end of the day. We will again briefly look over the repositories and will reach out to your group if we are worried about the progression of your project.
"},{"location":"projects/#day-4","title":"Day 4","text":"We have now entered the final week of the course and the second last project day. You are most likely continuing with bullet points from week 2, but should hopefully begin to look at the bullet points from week 3 today. These are in general much more complex, so we recommend looking at them until you have completed most from week 2. We also recommend that you being to fill our report template.
"},{"location":"projects/#day-5","title":"Day 5","text":"Today you are finishing your project. We recommend that you start by creating a architechtual overview of your project similar to this figure. I recommend using draw.io for creating this kind of diagram, but feel free to use any tool you like. Else you should just continue working on your project, checking of as many bullet points as possible. Finally, you should also prepare yourself for the exam tomorrow.
"},{"location":"projects/#project-checklist","title":"Project checklist","text":"Please note that all the lists are exhaustive meaning that I do not expect you to have completed very point on the checklist for the exam.
"},{"location":"projects/#week-1","title":"Week 1","text":"make_dataset.py
file such that it downloads whatever data you need andrequirements.txt
file with whatever dependencies that you are usingpep8
) while doing the projectThe exam consist of a written and oral element, and both contributes to the overall evaluation if you should pass or not pass the course.
For the written part of the exam we provide an template folder called reports. As the first task you should copy the folder and all its content to your project repository. Then, you jobs is to fill out the README.md
file which contains the report template. The file itself contains instructions on how to fill it out and instructions on using the included report.py
file. You will hand-in the template by simple including it in your project repository. By midnight on the 20/1 we will scrape it automatically, and changes after this point are therefore not registered.
For the oral part of the exam you will be given a time slot where you have to show up for 5-7 min and give a very short demo of your project. What we are interested in seeing is essentially a live demo of your deployed application/project. We will possibly also ask questions regarding the overall curriculum of the course. Importantly, you should have your deployed application, the github repository with your project code, W&B account and your GCP account ready before you enter the exam so we can quickly jump around. We will send out an the time slots during the last week.
"},{"location":"timeplan/","title":"Timeplan","text":"Slides
The course is organised into exercise (2/3 of the course) days and project days (1/3 of the course).
Exercise days start at 9:00 in the morning with an lecture (15-30 min) that will give some context about at least one of the topics of that day. Additionally, previous days exercises may shortly be touched upon. The remaining of the day will be spend on solving exercises either individually or in small groups. For some people the exercises may be fast to do and for others it will take the hole day. We will provide help throughout the day. We will try to answer questions on slack but help with be priorities to students physically on campus.
Project days are intended for project work and you are therefore responsible for making an agreement with your group when and where you are going to work. The first project days there will be a lecture at 9:00 with project information. Other project days we may also start the day with an external lecture, which we highly recommend that you participate in. During each project day we will have office hours where you can ask questions for the project.
Below is an overall timeplan for each day, including the presentation topic of the day and the frameworks that you will be using in the exercises.
Legend: \ud83d\udcdd Slides, \ud83c\udfa5 Recording.
Note
Current dates listed below are for January 2024 version of the course. The lectures and recordings are currently from January 2023 version of the course. Please note that for January 2024, the first week starts on a Tuesday and ends on a Saturday.
"},{"location":"timeplan/#week-1","title":"Week 1","text":"In the first week you will be introduced to a number of development practices for organizing and developing code, especially with a focus on making everything reproducible.
Date Day Presentation topic Frameworks Format 2/1 Tuesday Deep learning software \ud83d\udcdd \ud83c\udfa5(2023) \ud83c\udfa5(2024) Terminal, Conda, IDE, Pytorch Exercises 3/1 Wednesday MLOps: what is it? \ud83d\udcdd.pdf) \ud83c\udfa5(2023) \ud83c\udfa5(2023) Git, CookieCutter, Pep8, DVC Exercises 4/1 Thursday Reproducibility \ud83d\udcdd \ud83c\udfa5(2023) \ud83c\udfa5(2024) Docker, Hydra Exercises 5/1 Friday Debugging \ud83d\udcdd \ud83c\udfa5(2023) \ud83c\udfa5(2024) Debugger, Profiler, Wandb, Lightning Exercises 6/1 Saturday Pytorch ecosystem \ud83d\udcdd \ud83c\udfa5(2023) \ud83c\udfa5(2024) - Projects"},{"location":"timeplan/#week-2","title":"Week 2","text":"The second week is about automatization and the cloud. Automatization will help use making sure that our code does not break when we make changes to it. The cloud will help us scale up our applications and we learn how to use different services to help develop a full machine learning pipeline.
Date Day Presentation topic Frameworks Format 8/1 Monday Continuous Integration \ud83d\udcdd \ud83c\udfa5 Pytest, Github actions, Pre-commit, CML Exercises 9/1 Tuesday The Cloud \ud83d\udcdd \ud83c\udfa5 GCP Engine, Bucket, Container registry, Vertex AI Exercises 10/1 Wednesday Deployment \ud83d\udcdd \ud83c\udfa5 FastAPI, Torchservce, GCP Functions, Run Exercises 11/1 Thursday No lecture \ud83c\udfa5 - Projects 12/1 Friday No lecture \ud83c\udfa5 - Projects"},{"location":"timeplan/#week-3","title":"Week 3","text":"For the final week we look into advance topics such as monitoring and scaling of applications. Monitoring is especially important for the longivity for the applications that we develop, that we actually can deploy them either locally or in the cloud and that we have the tools to monitor how they behave over time. Scaling of applications is an important topic if we ever want our applications to be used by many people at the same time.
Date Day Presentation topic Frameworks Format 15/1 Monday Monitoring \ud83d\udcdd \ud83c\udfa5 Evidently AI, OpenTelemetry, Signoz Exercises 16/1 Tuesday Scalable applications \ud83d\udcdd \ud83c\udfa5 Pytorch, Lightning Exercises 17/1 Wednesday - - Projects 18/1 Thursday - - Projects 19/1 Friday - - Exam"},{"location":"reports/","title":"Exam template for 02476 Machine Learning Operations","text":"This is the report template for the exam. Please only remove the text formatted as with three dashes in front and behind like:
--- question 1 fill here ---
where you instead should add your answers. Any other changes may have unwanted consequences when your report is auto generated in the end of the course. For questions where you are asked to include images, start by adding the image to the figures
subfolder (please only use .png
, .jpg
or .jpeg
) and then add the following code in your answer:
![my_image](figures/<image>.<extension>)\n
In addition to this markdown file, we also provide the report.py
script that provides two utility functions:
Running:
python report.py html\n
will generate an .html
page of your report. After deadline for answering this template, we will autoscrape everything in this reports
folder and then use this utility to generate an .html
page that will be your serve as your final handin.
Running
python report.py check\n
will check your answers in this template against the constrains listed for each question e.g. is your answer too short, too long, have you included an image when asked to.
For both functions to work it is important that you do not rename anything. The script have two dependencies that can be installed with pip install click markdown
.
The checklist is exhaustic which means that it includes everything that you could possible do on the project in relation the curricilum in this course. Therefore, we do not expect at all that you have checked of all boxes at the end of the project.
"},{"location":"reports/#week-1","title":"Week 1","text":"make_dataset.py
file such that it downloads whatever data you need andrequirements.txt
file with whatever dependencies that you are usingpep8
) while doing the projectEnter the group number you signed up on
Answer:
--- question 1 fill here ---
"},{"location":"reports/#question-2","title":"Question 2","text":"Enter the study number for each member in the group
Example:
sXXXXXX, sXXXXXX, sXXXXXX
Answer:
--- question 2 fill here ---
"},{"location":"reports/#question-3","title":"Question 3","text":"What framework did you choose to work with and did it help you complete the project?
Answer length: 100-200 words.
Example: We used the third-party framework ... in our project. We used functionality ... and functionality ... from the package to do ... and ... in our project.
Answer:
--- question 3 fill here ---
"},{"location":"reports/#coding-environment","title":"Coding environment","text":"In the following section we are interested in learning more about you local development environment.
"},{"location":"reports/#question-4","title":"Question 4","text":"Explain how you managed dependencies in your project? Explain the process a new team member would have to go through to get an exact copy of your environment.
Answer length: 100-200 words
Example: We used ... for managing our dependencies. The list of dependencies was auto-generated using ... . To get a complete copy of our development environment, one would have to run the following commands
Answer:
--- question 4 fill here ---
"},{"location":"reports/#question-5","title":"Question 5","text":"We expect that you initialized your project using the cookiecutter template. Explain the overall structure of your code. Did you fill out every folder or only a subset?
Answer length: 100-200 words
Example: From the cookiecutter template we have filled out the ... , ... and ... folder. We have removed the ... folder because we did not use any ... in our project. We have added an ... folder that contains ... for running our experiments. Answer:
--- question 5 fill here ---
"},{"location":"reports/#question-6","title":"Question 6","text":"Did you implement any rules for code quality and format? Additionally, explain with your own words why these concepts matters in larger projects.
Answer length: 50-100 words.
Answer:
--- question 6 fill here ---
"},{"location":"reports/#version-control","title":"Version control","text":"In the following section we are interested in how version control was used in your project during development to corporate and increase the quality of your code.
"},{"location":"reports/#question-7","title":"Question 7","text":"How many tests did you implement and what are they testing in your code?
Answer length: 50-100 words.
Example: In total we have implemented X tests. Primarily we are testing ... and ... as these the most critical parts of our application but also ... .
Answer:
--- question 7 fill here ---
"},{"location":"reports/#question-8","title":"Question 8","text":"What is the total code coverage (in percentage) of your code? If you code had an code coverage of 100% (or close to), would you still trust it to be error free? Explain you reasoning.
Answer length: 100-200 words.
Example: The total code coverage of code is X%, which includes all our source code. We are far from 100% coverage of our ** code and even if we were then...*
Answer:
--- question 8 fill here ---
"},{"location":"reports/#question-9","title":"Question 9","text":"Did you workflow include using branches and pull requests? If yes, explain how. If not, explain how branches and pull request can help improve version control.
Answer length: 100-200 words.
Example: We made use of both branches and PRs in our project. In our group, each member had an branch that they worked on in addition to the main branch. To merge code we ...
Answer:
--- question 9 fill here ---
"},{"location":"reports/#question-10","title":"Question 10","text":"Did you use DVC for managing data in your project? If yes, then how did it improve your project to have version control of your data. If no, explain a case where it would be beneficial to have version control of your data.
Answer length: 100-200 words.
Example: We did make use of DVC in the following way: ... . In the end it helped us in ... for controlling ... part of our pipeline
Answer:
--- question 10 fill here ---
"},{"location":"reports/#question-11","title":"Question 11","text":"Discuss you continues integration setup. What kind of CI are you running (unittesting, linting, etc.)? Do you test multiple operating systems, python version etc. Do you make use of caching? Feel free to insert a link to one of your github actions workflow.
Answer length: 200-300 words.
Example: We have organized our CI into 3 separate files: one for doing ..., one for running ... testing and one for running ... . In particular for our ..., we used ... .An example of a triggered workflow can be seen here:
Answer:
--- question 11 fill here ---
"},{"location":"reports/#running-code-and-tracking-experiments","title":"Running code and tracking experiments","text":"In the following section we are interested in learning more about the experimental setup for running your code and especially the reproducibility of your experiments.
"},{"location":"reports/#question-12","title":"Question 12","text":"How did you configure experiments? Did you make use of config files? Explain with coding examples of how you would run a experiment.
Answer length: 50-100 words.
Example: We used a simple argparser, that worked in the following way: python my_script.py --lr 1e-3 --batch_size 25
Answer:
--- question 12 fill here ---
"},{"location":"reports/#question-13","title":"Question 13","text":"Reproducibility of experiments are important. Related to the last question, how did you secure that no information is lost when running experiments and that your experiments are reproducible?
Answer length: 100-200 words.
Example: We made use of config files. Whenever an experiment is run the following happens: ... . To reproduce an experiment one would have to do ...
Answer:
--- question 13 fill here ---
"},{"location":"reports/#question-14","title":"Question 14","text":"Upload 1 to 3 screenshots that show the experiments that you have done in W&B (or another experiment tracking service of your choice). This may include loss graphs, logged images, hyperparameter sweeps etc. You can take inspiration from this figure. Explain what metrics you are tracking and why they are important.
Answer length: 200-300 words + 1 to 3 screenshots.
Example: As seen in the first image when have tracked ... and ... which both inform us about ... in our experiments. As seen in the second image we are also tracking ... and ...
Answer:
--- question 14 fill here ---
"},{"location":"reports/#question-15","title":"Question 15","text":"Docker is an important tool for creating containerized applications. Explain how you used docker in your experiments? Include how you would run your docker images and include a link to one of your docker files.
Answer length: 100-200 words.
Example: For our project we developed several images: one for training, inference and deployment. For example to run the training docker image: docker run trainer:latest lr=1e-3 batch_size=64
. Link to docker file:
Answer:
--- question 15 fill here ---
"},{"location":"reports/#question-16","title":"Question 16","text":"When running into bugs while trying to run your experiments, how did you perform debugging? Additionally, did you try to profile your code or do you think it is already perfect?
Answer length: 100-200 words.
Example: Debugging method was dependent on group member. Some just used ... and others used ... . We did a single profiling run of our main code at some point that showed ...
Answer:
--- question 16 fill here ---
"},{"location":"reports/#working-in-the-cloud","title":"Working in the cloud","text":"In the following section we would like to know more about your experience when developing in the cloud.
"},{"location":"reports/#question-17","title":"Question 17","text":"List all the GCP services that you made use of in your project and shortly explain what each service does?
Answer length: 50-200 words.
Example: We used the following two services: Engine and Bucket. Engine is used for... and Bucket is used for...
Answer:
--- question 17 fill here ---
"},{"location":"reports/#question-18","title":"Question 18","text":"The backbone of GCP is the Compute engine. Explained how you made use of this service and what type of VMs you used?
Answer length: 100-200 words.
Example: We used the compute engine to run our ... . We used instances with the following hardware: ... and we started the using a custom container: ...
Answer:
--- question 18 fill here ---
"},{"location":"reports/#question-19","title":"Question 19","text":"Insert 1-2 images of your GCP bucket, such that we can see what data you have stored in it. You can take inspiration from this figure.
Answer:
--- question 19 fill here ---
"},{"location":"reports/#question-20","title":"Question 20","text":"Upload one image of your GCP container registry, such that we can see the different images that you have stored. You can take inspiration from this figure.
Answer:
--- question 20 fill here ---
"},{"location":"reports/#question-21","title":"Question 21","text":"Upload one image of your GCP cloud build history, so we can see the history of the images that have been build in your project. You can take inspiration from this figure.
Answer:
--- question 21 fill here ---
"},{"location":"reports/#question-22","title":"Question 22","text":"Did you manage to deploy your model, either in locally or cloud? If not, describe why. If yes, describe how and preferably how you invoke your deployed service?
Answer length: 100-200 words.
Example: For deployment we wrapped our model into application using ... . We first tried locally serving the model, which worked. Afterwards we deployed it in the cloud, using ... . To invoke the service an user would call curl -X POST -F \"file=@file.json\"<weburl>
Answer:
--- question 22 fill here ---
"},{"location":"reports/#question-23","title":"Question 23","text":"Did you manage to implement monitoring of your deployed model? If yes, explain how it works. If not, explain how monitoring would help the longevity of your application.
Answer length: 100-200 words.
Example: We did not manage to implement monitoring. We would like to have monitoring implemented such that over time we could measure ... and ... that would inform us about this ... behaviour of our application.
Answer:
--- question 23 fill here ---
"},{"location":"reports/#question-24","title":"Question 24","text":"How many credits did you end up using during the project and what service was most expensive?
Answer length: 25-100 words.
Example: Group member 1 used ..., Group member 2 used ..., in total ... credits was spend during development. The service costing the most was ... due to ...
Answer:
--- question 24 fill here ---
"},{"location":"reports/#overall-discussion-of-project","title":"Overall discussion of project","text":"In the following section we would like you to think about the general structure of your project.
"},{"location":"reports/#question-25","title":"Question 25","text":"Include a figure that describes the overall architecture of your system and what services that you make use of. You can take inspiration from this figure. Additionally in your own words, explain the overall steps in figure.
Answer length: 200-400 words
Example:
The starting point of the diagram is our local setup, where we integrated ... and ... and ... into our code. Whenever we commit code and puch to github, it auto triggers ... and ... . From there the diagram shows ...
Answer:
--- question 25 fill here ---
"},{"location":"reports/#question-26","title":"Question 26","text":"Discuss the overall struggles of the project. Where did you spend most time and what did you do to overcome these challenges?
Answer length: 200-400 words.
Example: The biggest challenges in the project was using ... tool to do ... . The reason for this was ...
Answer:
--- question 26 fill here ---
"},{"location":"reports/#question-27","title":"Question 27","text":"State the individual contributions of each team member. This is required information from DTU, because we need to make sure all members contributed actively to the project
Answer length: 50-200 words.
Example: Student sXXXXXX was in charge of developing of setting up the initial cookie cutter project and developing of the docker containers for training our applications. Student sXXXXXX was in charge of training our models in the cloud and deploying them afterwards. All members contributed to code by...
Answer:
--- question 27 fill here ---
"},{"location":"s10_extra/","title":"Extra learning modules","text":"All modules listed here are not part of the core course, but expands on some of the other topics. Some of them may still be under construction and may in the future be moved into other sessions.
"},{"location":"s10_extra/cli/","title":"M30 - Command Line Interfaces","text":""},{"location":"s10_extra/cli/#command-line-interfaces","title":"Command line interfaces","text":"
If you have worked with python for some time you are probably familiar with the argparse
package, which allows you to directly pass in additional arguments to your script in the terminal
python my_script.py --arg1 val1 --arg2 val2\n
argparse
is a very simple way of constructing what is called a command line interfaces (CLI). CLI allows you to interact with your application directly in the terminal instead of having change things in your code. It is essentially a text-based user interface (UI) (in contrast to an graphical user interface (GUI) that we know from all our desktop applications).
However, one limitation of argparse
is the possibility of easily defining an CLI with subcommands. If we take git
as an example, git
is the main command but it has multiple subcommands: push
, pull
, commit
etc. that all can take their own arguments. This kind of second CLI with subcommands is somewhat possible to do using only argparse
, however it requires a bit of hacks.
You could of course ask the question why we at all would like to have the possibility of defining such CLI. The main argument here is to give users of our code a single entrypoint to interact with our application instead of having multiple scripts. As long as all subcommands are proper documented, then our interface should be simple to interact with (again think git
where each subcommand can be given the -h
arg to get specific help).
Instead of using argparse
we are here going to look at the click package. click
extends the functionalities of argparse
to allow for easy definition of subcommands and many other things, which we are not going to touch upon in this module. For completeness we should also mention that click
is not the only package for doing this, and of other excellent frameworks for creating command line interfaces easily we can mention Typer.
Exercise files
Install click
pip install click\n
Create a new python file greetings.py
and add the following code:
import click\n\n@click.command()\n@click.option('--count', default=1, help='Number of greetings.')\n@click.option('--name', prompt='Your name', help='The person to greet.')\ndef hello(count, name):\n \"\"\"Simple program that greets NAME for a total of COUNT times.\"\"\"\n for x in range(count):\n click.echo(f\"Hello {name}!\")\n\nif __name__ == '__main__':\n hello()\n
try running the program in the following ways
python greetings.py\npython greetings.py --count=3\npython greetings.py --help\n
Make sure you understand what the click.command()
decorator and click.option
decorator does. You can find the full API docs here.
As stated above, the power of using a tool like click is due to its ability to define subcommands. In click
this is done through the click.group()
decorator. To the code example from above, add another command:
@click.command()\n@click.option('--count', default=1, help='Number of greetings.')\n@click.option('--name', prompt='Your name', help='The person to greet.')\ndef howdy(count, name):\n for x in range(count):\n click.echo(f\"Howdy {name}!\")\n
and by using the click.group()
decorator make these commands into subcommands such that you would be able to call the script in the following way
python greetings.py hello\npython greetings.py howdy\n
As an final exercise we provide you with a script that is ready to run as it is, but your job will be do turn it into a script with multiple subcommands, with multiple arguments for each subcommand.
Start by taking a look at the provided code. It is a simple script that runs the K-nearest neighbour classification algorithm on the iris dataset and produces a plot of the decision boundary.
Create a script that has the following subcommands with input arguments
train
: Load data, train model and save. Should take a single argument -o
that specifics the filename the trained model should be saved to.infer
: Load trained model and runs prediction on input data. Should take two arguments: -i
that specifies which trained model to load and -d
to specify a user defined datapoint to run inference on.plot
: Load trained model and constructs the decision boundary plot from the code. Should take two arguments: -i
that specifies a trained model to load and -o
the file to write the generated plot tooptim
: Load data, runs hyperparameter optimization and prints optimal parameters. Should at least take a single argument that in some way adjust the hyperparameter optimization (free to choose how)In the end we like the script to be callable in the following ways
python main.py train -o 'model.ckpt'\npython main.py infer -i 'model.ckpt' -d [[0,1]]\npython main.py plot -i 'model.ckpt' -o 'generated_plot.png'\npython main.py optim\n
Danger
Module is still under development
\"Machine learning engineering is 10% machine learning and 90% engineering.\" - Chip Huyen
We highly recommend that you read the book Designing Machine Learning Systems: An Iterative Process for Production-Ready Applications by Chip Huyen which gives an fantastic overview of the thought processes that goes into designing moder machine learning systems.
"},{"location":"s10_extra/design/#the-stack","title":"The stack","text":"Have you ever encountered the concept of full stack developer. A full stack developer is an developer who can both develop client and server software or in more general terms, it is a developer who can take care of the complete developer pipeline.
Below is seen an image of the massive amounts of tools that exist within the MLOps umbrella.
"},{"location":"s10_extra/design/#visualizing-the-design","title":"Visualizing the design","text":""},{"location":"s10_extra/documentation/","title":"M31 - Documentation","text":""},{"location":"s10_extra/documentation/#documentation","title":"Documentation","text":"In today's rapidly evolving software development landscape, effective documentation is a crucial component of any project. The ability to create clear, concise, and user-friendly technical documentation can make a significant difference in the success of your codebase. We all probably encountered code that we wanted to use, only for us to abandon using it because it was missing documentation such that we could get started with it.
Technical documentation or code documentation can be many things:
and many more. We are in this module going to focus on setting up a very basic documentation system that will automatically help you document the API of your code. For this reason we recommend that before continuning with this module that you have completed module M7 on good coding practices or have similar experience with writing docstrings for python functions and classes.
There are different systems for writing documentation. In fact there is a lot to choose from:
Important to note that all these are static site generators. The word static here refers to that when the content is generated and served on a webside, the underlying HTML code will not change. It may contain HTML elements that dynamic (like video), but the site does not change (1).
We are in this module going to look at Mkdocs, which (in my opinion) is one of the easiest systems to get started with because all documentation is written in markdown and the build system is written in Python. As an alternativ, you can consider doing the exercises in Sphinx which is probably the most used documentation system for Python code. Sphinx offer more customization than Mkdocs, so is generally preferred for larger projects with complex documentation, but for smaller projects Mkdocs should be easier to get started with and is sufficient.
Mkdocs by default does not include many features and for that reason we are directly going to dive into using the material for mkdocs theme that provides a lot of nice customization to create professional static sites. In fact, this hole course is written in mkdocs using the material theme.
"},{"location":"s10_extra/documentation/#mkdocs","title":"Mkdocs","text":"The core file when using mkdocs is the mkdocs.yml
file, which is the configuration file for the project:
site_name: Documentation of my project\nsite_author: Jane Doe\ndocs_dir: source # (1)!\n\ntheme:\n language: en\n name: material # (2)!\n features: # (3)!\n - content.code.copy\n - content.code.annotate\n\nplugins: # (4)!\n - search\n - mkdocstrings\n\nnav: # (5)!\n - Home: index.md\n
This indicates the source directory of our documentation. If the layout of your documentation is a bit different than what described above, you may need to change this.
The overall theme of your documentation. We recommend the material
theme but there are many more to choose from and you can also create your own.
The featuers
section is where features that are supported by your given theme can be enabled. In this example we have enabled content.code.copy
feature which adds a small copy button to all code block and the content.code.annotate
feature which allows you to add annotations like this box to code blocks.
Plugins add new functionality to your documentation. In this case we have added two plugins that add functionality for searching through our documentation and automatically adding documentation from docstrings. Remember that some plugins requires you to install additional Python packages with those plugins, so remember to add them to your requirements.txt
file.
The nav
section is where you define the navigation structure of your documentation. When you add new .md
files to the source
folder you then need to add them to the nav
section.
And that is more or less what you need to get started. In general, if you need help with configuration of your documentation in mkdocs I recommend looking at this page and this page.
"},{"location":"s10_extra/documentation/#exercises","title":"Exercises","text":"In this set of exercises we assume that you have completed module M6 on code structure and therefore have a repository that at least contains the following:
\u251c\u2500\u2500 pyproject.toml <- Project configuration file with package metadata\n\u2502\n\u251c\u2500\u2500 docs <- Documentation folder\n\u2502 \u2502\n\u2502 \u251c\u2500\u2500 index.md <- Homepage for your documentation\n\u2502 \u2502\n\u2502 \u251c\u2500\u2500 mkdocs.yml <- Configuration file for mkdocs\n\u2502 \u2502\n\u2502 \u2514\u2500\u2500 source/ <- Source directory for documentation files\n\u2502\n\u2514\u2500\u2500 src <- Source code for use in this project.\n\u2502 \u2502\n\u2502 \u251c\u2500\u2500 __init__.py <- Makes src a Python module\n\u2502 \u2502\n\u2502 \u251c\u2500\u2500 models <- model implementations, training script\n\u2502 \u2502 \u251c\u2500\u2500 __init__.py\n\u2502 \u2502 \u251c\u2500\u2500 model.py\n\u2502 \u2502 \u251c\u2500\u2500 train_model.py\n...\n
It is not important exactly what is in the src
folder for the exercises, but we are going to refer to the above structure in the exercises, so adjust accordingly if you diviate from this. Additionally, we are going to assume that your project code is installed in your environment such that it can be imported as normal python code.
We are going to need two python packages to get started: mkdocs and material for mkdocs. Install with
pip install \"mkdocs-material >= 4.8.0\" # (1)!\n
mkdocs
is a dependency of mkdocs-material
we only need to install the latter.Run in your terminal (from the docs
folder):
mkdocs serve # (1)!\n
mkdocs serve
will automatically rebuild the hole site whenever you save a file inside the docs
folder. This is not a problem if you have a fairly small site with not that many pages (or elements), but can take a long time for large sites. Consider running with the --dirty
option for only re-building the site for files that have been changed.which should render the index.md
file as the homepage. You can leave the documentation server running during the remaining exercises.
We are no ready to document the API of our code:
Make sure you at least have one function and class inside your src
module. If you do not have you can for simplicity copy the following module to the src/models/model.py
file
import torch\n\nclass MyNeuralNet(torch.nn.Module):\n \"\"\"Basic neural network class.\n\n Args:\n in_features: number of input features\n out_features: number of output features\n\n \"\"\"\n def __init__(self, in_features: int, out_features: int) -> None:\n self.l1 = torch.nn.Linear(in_features, 500)\n self.l2 = torch.nn.Linear(500, out_features)\n self.r = torch.nn.ReLU()\n\n def forward(self, x: torch.Tensor) -> torch.Tensor:\n \"\"\"Forward pass of the model.\n\n Args:\n x: input tensor expected to be of shape [N,in_features]\n\n Returns:\n Output tensor with shape [N,out_features]\n\n \"\"\"\n return self.l2(self.r(self.l1(x)))\n
and the following function to add src/predict_model.py
file:
def predict(\n model: torch.nn.Module,\n dataloader: torch.utils.data.DataLoader\n) -> None:\n \"\"\"Run prediction for a given model and dataloader.\n\n Args:\n model: model to use for prediction\n dataloader: dataloader with batches\n\n Returns\n Tensor of shape [N, d] where N is the number of samples and d is the output dimension of the model\n\n \"\"\"\n return [model(batch) for batch in dataloader]\n
Add a markdown file to the docs/source
folder called my_api.md
and add that file to the nav:
section in the mkdocs.yaml
file.
To that file add the following code:
# My API\n\n::: src.models.model.MyNeuralNet\n\n::: src.predict_model.predict\n
The :::
indicator tells mkdocs that it should look for the corresponding function/module and then render it on the given page. Thus, if you have a function/module located in another location change the paths accordingly.
Make sure that the documentation correctly includes your function and module on the given page.
(Optional) Include more functions/modules in your documentation.
(Optional) Look through the documentation for mkdocstrings and try to improve the layout a bit. Especially, the headings, docstrings and signatures could be of interest to adjust.
Finally, try to build a final version of your documentation
mkdocs build\n
this should result in a site
folder that contains the actual HTML code for documentation.
To publish your documentation you need a place to host your build documentation e.g. the content of the site
folder you build in the last exercise. There are many places to host your documentation, but if you only need a static site and are already hosting your code through Github, then a good option is Github Pages. Github pages is free to use for your public projects.
Before getting started with this set of exercises you should have completed module M16 on github actions so you already know about workflow files.
"},{"location":"s10_extra/documentation/#exercises_1","title":"Exercises","text":"Start by adding a new file called deploy_docs.yaml
to the .github/workflows
folder. Add the following cod to that file and save it.
name: Deploy docs\n\non:\npush:\n branches:\n - main\n\npermissions:\n contents: write # (1)\n\njobs:\ndeploy:\n runs-on: ubuntu-latest\n steps:\n - uses: actions/checkout@v3\n with:\n fetch-depth: 0\n - uses: actions/setup-python@v4\n with:\n python-version: 3.10\n - uses: actions/cache@v2\n with:\n key: ${{ github.ref }}\n path: .cache\n - run: pip install -r requirements.txt\n - run: mkdocs gh-deploy --force\n
write
premissions to this actions because it is not only reading your code but it will actually also push code.Before continuing, make sure you understand what the different steps of the workflow does and especially we recommend looking at the documentation of the mkdocs gh-deploy
command.
Commit and push the file. Check that the action is executed and if it succeeds, that your build project is pushed to a branch called gh-pages
. If the action does not succeeds, then figure out what is wrong and fix it!
After confirming that our action is working, you need to configure Github to actually publish the content being build by Github Actions. Do the following:
Source
setting choose the Deploy from a branch
Branch
setting choose the gh-pages
branch and /(root)
folder and save
This should then start deploying your site to https://<your-username>.github.io/<your-reponame>/
. If it does not do this you may need to recommit and trigger the github actions build again.
Make sure your documentation is published and looks as it should.
This ends the module on technical documentation. We cannot stress enough how important it is to write proper documentation for larger projects that need to be maintained over a longer time. It is often a iterative process, but it is often best to do it while writing the code.
"},{"location":"s10_extra/frontend/","title":"Frontend","text":"Danger
Module is still under development
"},{"location":"s10_extra/frontend/#streamlit","title":"Streamlit","text":"steamlit
streamlit
pip install streamlit\n
and run streamlit hallo
afterwards to check that everything works as expected.
As discussed in the intro session on the cloud, cloud providers offers near infinite compute resources. However, using these resources comes at a hefty price often and it is therefore important to be aware of another resource many have access to: High Performance Clusters or HPC. HPCs exist all over the world, and many time you already have access to one or can easily get access to one. If you are an university student you most likely have a local HPC that you can access through your institution. Else, there exist public HPC resources that everybody (with a project) can apply for. As an example in the EU we have EuroHPC initiative that currently has 8 different supercomputers with a centralized location for applying for resources that are both open for research projects and start-ups.
Depending on your application, you may have different needs and it is therefore important to be aware also of the different tiers of HPC. In Europe, HPC are often categorized such that Tier-0 are European Centers with petaflop or hexascale machines,\u00a0Tier 1 are National centers of supercomputers, and Tier 2 are Regional centers. The lower the Tier, the larger applications it is possible to run.
Image credit"},{"location":"s10_extra/high_performance_clusters/#cluster-architectures","title":"Cluster architectures","text":"In very general terms, cluster can come as two different kind of systems: supercomputers and LSF (Load Sharing Facility). A supercomputer (as shown below) is organized into different modules, that are separated by network link. When you login to a supercomputer you will meet the front end which contains all the software needed to run computations. When you submit a job it will get sent to the backend modules which in most cases includes: general compute modules (CPU), acceleration modules (GPU), a memory module (RAM) and finally a storage module (HDD). Depending on your application you may need one module more than another. For example in deep learning the acceleration module is important but in physics simulation the general compute module / storage model is probably more important.
Overview of the Meluxina supercomputer that's part of EuroHPC. Image creditAlternatively, LSF are a network of computers where each computer has its own CPU, GPU, RAM etc. and the individual computes (or nodes) are then connected by network. The important different between a supercomputer and as LSF systems is how the resources are organized. When comparing supercomputers to LSF system it is generally the case that it is better to run on a LSF system if you are only requesting resources that can be handled by a single node, however it is better to run on a supercomputer if you have a resource intensive application that requires many devices to communicate with each others.
Regardless of cluster architectures, on the software side of HPC, the most important part is what's called the HPC scheduler. Without a HPC scheduler an HPC cluster would just be a bunch of servers with different jobs interfering with each other. The problem is when you have a large collection of resources and a large collection of users, you cannot rely on the users just running their applications without interfering with each other. A HPC scheduler is in charge of managing that whenever an user request to run an application, they get put in a queue and whenever the resources their application ask for are available the application gets run.
The biggest bach control systems for doing scheduling on HPC are:
We are going to take a look at PBS works as that is what is installed on our local university cluster.
"},{"location":"s10_extra/high_performance_clusters/#exercises","title":"\u2754 Exercises","text":"Exercise files
The following exercises are focused on local students at DTU that want to use our local HPC resources. That said, the steps in the exercise are fairly general to other types of cluster. For the purpose of this exercise we are going to see how we can run this image classifier script , but feel free to work with whatever application you want to.
Start by accessing the cluster. This can either be through ssh
in a terminal or if you want a graphical interface thinlinc can be installed. In general we recommend following the steps here for DTU students as the setup depends on if you are on campus or not.
When you have access to the cluster we are going to start with the setup phase. In the setup phase we are going to setup the environment necessary for our computations. If you have accessed the cluster through graphical interface start by opening a terminal.
Lets start by setting up conda for controlling our dependencies. If you have not already worked with conda
, please checkout module M2 on package managers and virtual environments. In general you should be able to setup (mini)conda through these two commands:
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh\nsh Miniconda3-latest-Linux-x86_64.sh\n
Close the terminal and open a new for the installation to complete. Type conda
in the terminal to check that everything is fine. Go ahead and create a new environment that we can install dependencies in
conda create -n \"hpc_env\" python=3.10 --no-default-packages\n
and activate it.
Copy over any files you need. For the image classifier script you need the requirements file and the actual application.
Next, install all the requirements you need. If you want to run the image classifier script you can run this command in the terminal
pip install -r image_classifier_requirements.txt\n
using this requirements file.
That's all the setup needed. You would need to go through the creating of environment and installation of requirements whenever you start a new project (no need for reinstalling conda). For the next step we need to look at how to submit jobs on the cluster. We are now ready to submit the our first job to the cluster:
Start by checking the statistics for the different clusters. Try to use both the qstat
command which should give an overview of the different cluster, number of running jobs and number of pending jobs. For many system you can also try the much more user friendly command classstat
command.
Figure out which queue you want to use. For the sake of the exercises it needs to be one with GPU support. For DTU students, any queue that starts with gpu
are GPU accelerated.
Now we are going to develop a bash script for submitting our job. We have provided an example of such scripts. Take a careful look and go each line and make sure you understand it. Afterwards, change it to your needs (queue and student email).
Try to submit the script:
bsub < jobscript.sh\n
You can check the status of your script by running the bstat
command. Hopefully, the job should go through really quickly. Take a look at the output file, it should be called something like gpu_*.out
. Also take a look at the gpu_*.err
file. Does both files look as they should?
Lets now try to run our application on the cluster. To do that we need to take care of two things:
First we need to load the correct version of CUDA. A cluster system often contains multiple versions of specific software to suit the needs of all their users, and it is the users that are in charge of loading the correct software during job submission. The only extra software that needs to be loaded for most Pytorch applications are a CUDA module. You can check which modules are available on the cluster with
module avail\n
Afterwards, add the correct CUDA version you need to the jobscript.sh
file. If you are trying to run the provided image classifier script then the correct version is CUDA/11.7
(can be seen in the requirements file).
# add to the bottom of the file\nmodule load cuda/11.7\n
We are now ready to add in our application. The only thing we need to take care of is telling the system to run it using the python
version that is connected to our hpc_env
we created in the beginning. Try typing:
which python\n
which should give you the full path. Then add to the bottom of the jobscript
file:
~/miniconda3/envs/hpc_env/bin/python \\\n image_classifier.py \\\n --trainer.accelerator 'gpu' --trainer.devices 1 --trainer.max_epochs 5\n
which will run the image classifier script (change it if you are running something else).
Finally submit the job:
bsub < jobscript.sh\n
and check when it is done that it has produced what you expected.
(Optional) If you application supports multi GPUs also try that out. You would first need to change the jobscript to request multiple GPUs and additionally you would need to tell your application to run on multiple GPUs. For the image classifier script it can be done by changing the --trainer.devices
flag to 2
(or higher).
This ends the module on using HPC systems.
"},{"location":"s10_extra/hyperparameters/","title":"M32 - Hyperparameter optimization","text":""},{"location":"s10_extra/hyperparameters/#hyperparameter-optimization","title":"Hyperparameter optimization","text":"Hyperparameter optimization is not a new idea within machine learning but have somewhat seen a renaissance with the uprise of deep learning. This can mainly be contributed to the following:
However the problem with doing hyperparameter optimization of a deep learning models is that it can take over a week to train a single model. In most cases we therefore cannot do a full grid search of all hyperparameter combinations to get the best model. Instead we have to do some tricks that will help us speed up our searching. In these exercises we are going to be integrating optuna into our different models, that will provide the tools for speeding up our search.
It should be noted that a lot of deep learning models does not optimize every hyperparameter that is included in the model but instead relies on heuristic guidelines (\"rule of thumb\") based on what seems to be working in general e.g. a learning rate of 0.01 seems to work great with the Adam optimizer. That said, these rules probably only apply to 80% of deep learning model, whereas for the last 20% the recommendations may be suboptimal Here is a great site that has collected an extensive list of these recommendations, taken from the excellent deep learning book by Ian Goodfellow, Yoshua Bengio and Aaron Courville.
In practice, I recommend trying to identify (through experimentation) which hyperparameters that are important for the performance of your model and then spend your computational budget trying to optimize them while setting the rest to a \"recommended value\".
"},{"location":"s10_extra/hyperparameters/#exercises","title":"\u2754 Exercises","text":"Exercise files
Start by installing optuna: pip install optuna
Initially we will look at the cross_validate.py
file. It implements simple K-fold cross validation of a random forest sklearn digits dataset (subset of MNIST). Look over the script and try to run it.
We will now try to write the same code in optune. Please note that the script have a variable OPTUNA=False
that you can use to change what part of the code should run. The three main concepts of optuna is
A trial: a single experiment
A study: a collection of trials
The objective: function to determine how \"good\" a trial is
Lets start by writing the objective function, which we have already started in the script. For now you do not need to care about the trial
argument, just assume that it contains the hyperparameters needed to define your random forest model. The output of the objective function should be a single number that we want to optimize. (HINT: did you remember to do K-fold crossvalidation inside your objective function?)
Next lets focus on the trial. Inside the objective
function the trial should be used to suggest what parameters to use next. Take a look at the documentation for trial or take a look at the code examples and figure out how to define the hyperparameter of the model.
Finally lets launch a study. It can be as simple as
study = optuna.create_study()\nstudy.optimize(objective, n_trials=100)\n
but lets play around a bit with it:
By default the .optimize
method will minimize the objective (by definition the optimum of an objective function is at its minimum). Is the score your objective function is returning something that should be minimized? If not, a simple solution is to put a -
in front of the metric. However, look through the documentation on how to change the direction of the optimization.
Optuna will by default do Bayesian optimization when sampling the hyperparameters (using a evolutionary algorithm for suggesting new trials). However, since this example is quite simple, we can actually perform a full grid search. How would you do this in Optuna?
Compare the performance of a single optuna run using Bayesian optimization with n_trials=10
with a exhaustive grid search that have search through all hyperparameters. What is the performance/time trade-off for these two solutions?
In addition to doing baysian optimization, the other great part about Optuna is that it have native support for Pruning unpromising trials. Pruning refers to the user stopping trials for hyperparameter combinations that does not seem to lead anywhere. You may have learning rate that is so high that training is diverging or a neural network with too many parameters so it is just overfitting to the training data. This however begs the question: what constitutes an unpromising trial? This is up to you to define based on prior experimentation.
Start by looking at the fashion_trainer.py
script. Its a simple classification network for classifying images in the FashionMNIST dataset. Run the script with the default hyperparameters to get a feeling of how the training should be progress. Note down the performance on the test set.
Start by defining a validation set and a validation dataloader that we can use for hyperparameter optimization (HINT: use 5-10% of you training data).
Now, adjust the script to use Optuna. The 5 hyperparameters listed in the table above should at least be included in the hyperparameter search. For some we have already defined the search space but for the remaining you need to come up with a good range of values to investigate. We done integrating optuna, run a small study (n_tirals=3
) to check that the code is working.
nn.ReLU
, nn.Tanh
, nn.RReLU
, nn.LeakyReLU
, nn.ELU
} If implemented correctly the number of hyperparameter combinations should be at least 1000, meaning that we not only need baysian optimization but probably also need pruning to succeed. Checkout the page for built-in pruners in Optuna. Implement pruning in the script. I recommend using either the MedianPruner
or the ProcentilePruner
.
Re-run the study using pruning with a large number of trials (n_trials>50
)
Take a look at this visualization page for ideas on how to visualize the study you just did. Make at least two visualization of the study and make sure that you understand them.
Pruning is great for better spending your computational budged, however it comes with a trade-off. What is it and what hyperparameter should one be especially careful about when using pruning?
Finally, what parameter combination achieved the best performance? What is the test set performance for this set of parameters. Did you improve over the initial set of hyperparameters?
The exercises until now have focused on doing the hyperparameter searching sequentially, meaning that we test one set of parameters at the time. It is a fine approach because you can easily let it run for a week without any interaction. However, assuming that you have the computational resources to run in parallel, how do you do that?
To run hyperparameter search in parallel we need a common database that all experiments can read and write to. We are going to use the recommended mysql
. You do not have to understand what SQL is to complete this exercise, but it is basically a language (like python) for managing databases. Install mysql.
Next we are going to initialize a database that we can read and write to. For this exercises we are going to focus on a locally stored database but it could of course also be located in the cloud.
mysql -u root -e \"CREATE DATABASE IF NOT EXISTS example\"\n
you can also do this directly in python when calling the create_study
command by also setting the storage
and load_if_exists=True
flags.
Now we are going to create a Optuna study in our database
optuna create-study --study-name \"distributed-example\" --storage \"mysql://root@localhost/example\"\n
Change how you initialize the study to read and write to the database. Therefore, instead of doing
study = optuna.create_study()\n
then do
study = optuna.load_study(\n study_name=\"distributed-example\", storage=\"mysql://root@localhost/example\"\n)\n
where the study_name
and storage
should match how the study was created.
For running in parallel, you can either open up a extra terminal and simple launch your script once per open terminal or you can use the provided parallel_lancher.py
that will launch multiple executions of your script. It should be used as:
python parallel_lancher.py myscript.py --num_parallel 2\n
Finally, make sure that you can access the results
That's all on how to do hyperparameter optimization in a scalable way. If you feel like it you can try to apply these techniques on the ongoing corrupted MNIST example, where you are free to choose what hyperparameters that you want to use.
"},{"location":"s10_extra/kubernetes/","title":"Kubernetes","text":"Danger
Module is still under development
"},{"location":"s10_extra/kubernetes/#kubernetes","title":"Kubernetes","text":"Kubernetes, also known as K8s, is an open-source platform designed to automate the deployment, scaling, and operation of application containers across clusters of hosts. It provides the framework to run distributed systems resiliently, handling scaling and failover for your applications, providing deployment patterns, and more.
"},{"location":"s10_extra/kubernetes/#what-is-kubernetes","title":"What is Kubernetes?","text":""},{"location":"s10_extra/kubernetes/#brief-history","title":"Brief History","text":"Kubernetes was originally developed by Google and is now maintained by the Cloud Native Computing Foundation.
"},{"location":"s10_extra/kubernetes/#core-functions","title":"Core Functions","text":"Kubernetes makes it easier to deploy and manage containerized applications at scale.
"},{"location":"s10_extra/kubernetes/#key-concepts","title":"Key Concepts","text":"Kubernetes follows a client-server architecture. At a high level, it consists of a Control Plane (master) and Nodes (workers).
Overview of Kubernetes Architecture. Image Credit: Kubernetes Official Documentation"},{"location":"s10_extra/kubernetes/#control-plane-components","title":"Control Plane Components","text":"Minikube is a tool that allows you to run Kubernetes locally. It runs a single-node Kubernetes cluster inside a VM on your laptop for users looking to try out Kubernetes or develop with it day-to-day.
"},{"location":"s10_extra/kubernetes/#installing-minikube","title":"Installing Minikube","text":"minikube start
.minikube
in a terminal.kubectl
in a terminal.Yatai is a model serving platform, making it easier to deploy machine learning models in Kubernetes environments.
"},{"location":"s10_extra/kubernetes/#what-is-yatai","title":"What is Yatai?","text":"Yatai simplifies the deployment, management, and scaling of machine learning models in Kubernetes.
"},{"location":"s10_extra/kubernetes/#getting-started-with-yatai","title":"Getting Started with Yatai","text":"Danger
Module is still under development
"},{"location":"s10_extra/onnx/#model-packaging","title":"Model packaging","text":"Whenever we want to serve an machine learning model, what we are actually interested in is doing predictions e.g. given a new datapoint we pass it through our model (forward pass) and the returned value is the predicted value of that datapoint. At a high-level, model predictions depends on three things:
We have already in module M9 on Docker touch on how to take care of all these things. Containers makes it easy to link a codebase, model weights and code dependencies into a single object. We in general can refer to this as model packaging, because as the name suggest, we are packaging our model into a format that is independent of the actual environment that we are trying to run the model in.
However, containers is not the only way to do model packaging. If we put some light restrictions on the device we want run our model predictions on, we can achieve the same result using ONNX. The Open Neural Network Exchange (ONNX) is a standardized format for creating and sharing machine learning models. ONNX provides an open source format for machine learning models, both deep learning and traditional ML. It defines an extensible computation graph model, as well as definitions of built-in operators and standard data types.
Image creditAs the above image indicates, the idea behind ONNX is that a model trained with a specific framework on a specific device, lets say Pytorch on your local computer, can be exported and run with an entirely different framework and hardware easily. For example, not all frameworks are created equally. For example Pytorch is in general considered an developer friendly framework, however it has historically been slow to run inference with compared to a framework such as Caffe2. ONNX allow you to mix-and-match frameworks based on different usecases, and essentially increases the longivity of your model.
"},{"location":"s10_extra/onnx/#exercises","title":"\u2754 Exercises","text":"Start by installing ONNX:
pip install onnx\npip install onnxruntime\n
the first package includes the basic building blocks for implementing generalized ONNX models and the second package is for running ONNX optimal on different hardware.
As an test that your installation is working, try executing the following python code
import onnxruntime\nonnxruntime.get_all_providers()\n
these providers are translation layers that are implemented ONNX, such that the same ONNX model can run on completely different hardware. Can you identify at least two of the providers that are necessary for running standard Pytorch code on CPU and GPU? Can you identify others
One big advantage of having a standardized format, is that we can easily visualize the computational graph of our model because it consist only of core ONNX operations. We are here going to use the open-source tool netron for visualization. You can either choose to download the program or just run it in your webbrowser.
Danger
Module is still under development
Image credit"},{"location":"s10_extra/pipeline/#dags","title":"DAGs","text":"Directed Acyclic Graph (DAG)
"},{"location":"s10_extra/pipeline/#exercises","title":"\u2754 Exercises","text":"Start by installing prefect
:
pip install prefect\n
Start a local Prefect server instance in your virtual environment.
prefect server start\n
The great thing about Prefect is that the orchestration tasks and flows are written in pure Python.
Slides
Today we start our journey into the world of machine learning operations (MLOps). However, before we can really get started, we need to make sure that you have a basic understanding of a couple of topics, as we will be using these throughout the course. In particular, today is all about getting set up with a proper development environment that can support your journey. Most of you probably already have experience with these topics, and it will be mostly repetition.
The reason we are starting here is that many students are missing very basic skills that are never taught but are just expected to be picked up by yourself. This session will only cover the most basic skills to get you started on your development journey. If you wish to learn more about basic computer science skills in general, we highly recommend that you check out The Missing Semester of Your CS Education course from MIT.
Learning objectives
The learning objectives of this session are:
Core Module
Image creditContrary to popular belief, the command line (also commonly known as the terminal) is not a mythical being that has existed since the dawn of time. Instead, it was created at a time when it was not given that your computer had a graphical interface that you could interact with. Think of it as a text interface to your computer.
The terminal is a well-known concept to users of Linux, however, MAC and (especially) Windows users often do not need and therefore encounter it. Having a basic understanding of how to use a command line can help improve your workflow. The reason that the command line is an important tool to get to know, is that doing any kind of MLOps will require us to be able to interact with many different tools, many of which do not have a graphical interface. Additionally, when we get to working in the cloud later in the course, you will be forced to interact with the command line.
Note if you already are a terminal wizard then feel free to skip the exercises below. They are very elementary.
"},{"location":"s1_development_environment/command_line/#the-anatomy-of-the-command-line","title":"The anatomy of the command line","text":"Regardless of the operating system, all command lines look more or less the same:
As already stated, it is essentially just a big text interface to interact with your computer. As the image illustrates, when trying to execute a command, there are several parts to it:
$
, >
, :
are the usual ones. It can also contain other information, such as in the case of the above image which also shows the current conda
environment.ls
or cd
ls -l
or cd ..
.ls -l figures
or cd ..
.The core difference between options and arguments is that options are optional, while arguments are not.
Image credit"},{"location":"s1_development_environment/command_line/#exercises","title":"\u2754 Exercises","text":"We have put a cheat sheet in the exercise files folder belonging to this session, that gives a quick overview of the different commands that can be executed in the command line.
Windows usersWe highly recommend that you install Windows Subsystem for Linux (WSL). This will install a full Linux system on your Windows machine. Please follow this guide. Remember to run commands from an elevated (as administrator) Windows Command Prompt. You can in general complete all exercises in the course from a normal Windows Command Prompt, but some are easier to do if you run from WSL.
If you decide to run in WSL you need to remember that you now have two different systems, and installing a package on one system does not mean that it is installed on the other. For example, if you install pip
in WSL, you need to install it again in Windows if you want to use it there.
If you decide to not run in WSL, please always work in a Windows Command Prompt and not Powershell.
Start by opening a terminal.
To navigate inside a terminal, we rely on the cd
command and pwd
command. Make sure you know how to go back and forth in your file system. (1)
The ls
command is important when we want to know the content of a folder. Try to use the command, and also try it with the additional option -l
. What does it show?
Make sure to familiarize yourself with the which
, echo
, cat
, wget
, less
and top
commands. Also, familiarize yourself with the >
operator. You are probably going to use some of them throughout the course or in your future career. For Windows users, these commands may be named something else, e.g. where
command on Windows corresponds to which
.
It is also significant that you know how to edit a file through the terminal. Most systems should have the nano
editor installed, else try to figure out which one is installed in your system.
Type nano
in the terminal
Write the following text in the script
if __name__ == \"__main__\":\n print(\"Hello world!\")\n
Save the script and try to execute it
Afterward, try to edit the file through the terminal (change Hello world
to something else)
All terminals come with their own programming language. The most common system is called bash
. It can come in handy being able to write simple programs in bash. For example, one case is that you want to execute multiple Python programs sequentially, which can be done through a bash script.
Bash is not part of Windows, so you need to run this part through WSL. If you did not install WSL, you can skip this part or as an alternative do the exercises in Powershell which is the native Windows scripting language (not recommended).
Write a bash script (in nano
) and try executing it:
#!/bin/bash\n# A sample Bash script, by Ryan\necho Hello World!\n
Change the bash script to call the Python program you just wrote.
Try to Google how to write a simple for-loop that executes the Python script 10 times in a row.
Here is one command from later in the course when we are going to work in the cloud
gcloud compute instances create-with-container instance-1 \\\n --container-image=gcr.io/<project-id>/gcp_vm_tester\n --zone=europe-west1-b\n
Identify the command, options and arguments.
Solutiongcloud compute instances create-with-container
.--container-image=gcr.io/<project-id>/gcp_vm_tester
and --zone=europe-west1-b
.instance-1
.The tricky part of this example is that commands can have subcommands, which are also commands. In this case compute
is a subcommand to gcloud
, instances
is a subcommand to compute
and create-with-container
is a subcommand to instances
Two common arguments that nearly all commands have are the -h
and -V
options. What does each of them do?
The -h
(or --help
) option prints the help message for the command, including subcommands and arguments. Try it out by executing python -h
. The -V
(or --version
) option prints the version of the installed program. Try it out by executing python --version
.
This ends the module on the command line. If you are still not comfortable working with the command line, fear not as we are going to use it extensively throughout the course. If you want to spend additional time on this topic, we highly recommend that you watch this video on how to use the command line.
If you are interested in personalizing your command line, you can check out the starship project, which allows you to customize your command line with a lot of different options.
"},{"location":"s1_development_environment/deep_learning_software/","title":"M4 - Deep Learning Software","text":""},{"location":"s1_development_environment/deep_learning_software/#deep-learning-software","title":"Deep Learning Software","text":"Core Module
Deep learning has since its revolution back in 2012 transformed our lives. From Google Translate to driverless cars to personal assistants to protein engineering, deep learning is transforming nearly every sector of our economy and our lives. However, it did not take long before people realized that deep learning is not a simple beast to tame and it comes with its own kinds of problems, especially if you want to use it in a production setting. In particular the concept of technical debt was invented to indicate the significant maintenance costs at a system level that it takes to run machine learning in production. MLOps should very much be seen as the response to the concept of technical debt, namely that we should develop methods, processes and tools (with inspiration from classical DevOps) to counter the problems we run into when working with deep learning models.
It is important to note that all the concepts and tools that have been developed for MLOps can absolutely be used together with more classical machine learning models (think K-nearest neighbor, Random forest etc.), however deep learning comes with its own set of problems which mostly have to do with the sheer size of the data and models we are working with. For these reasons, we are focusing on working with deep learning models in this course.
"},{"location":"s1_development_environment/deep_learning_software/#software-landscape-for-deep-learning","title":"Software landscape for Deep Learning","text":"Regarding software for Deep Learning, the landscape is currently dominated by three software frameworks (listed in order of when they were published):
Tensorflow
Pytorch
JAX
We won't go into a longer discussion on which framework is best, as it is pointless. Pytorch and Tensorflow have been around for the longest and therefore have bigger communities and feature sets at this point in time. They are both very similar in the sense that they both have features directed against research and production. JAX is kind of the new kid on the block, which in many ways improves on Pytorch and Tensorflow, but is still not as mature as the other frameworks. As the frameworks use different kind of programming principles (object oriented vs. functional programming), comparing them is essentially meaningless.
In this course we have chosen to work with Pytorch, because we find it a bit more intuitive and it is the framework that we use for our day to day research life. Additionally, as of right now it is absolutely the dominating framework for published models, research papers and competition winners
The intention behind this set of exercises is to bring everyone's Pytorch skills up-to-date. If you already are a Pytorch-Jedi feel free to pass the first set of exercises, but I recommend that you still complete it. The exercises are in large part taken directly from the deep learning course at udacity. Note that these exercises are given as notebooks, which is the last time we are going to use them actively in course. Instead, after this set of exercises, we are going to focus on writing code in python scripts.
The notebooks contains a lot of explaining text. The exercises that you are supposed to fill out are inlined in the text in small \"exercise\" blocks:
If you need a fresh-up on any deep learning topic in general throughout the course, we recommend to find the relevant chapter in the deep learning book by Ian Goodfellow, Yoshua Bengio and Aaron Courville (can also be found in the literature folder). It is absolutely not necessary to be good at deep learning to pass this course as the focus is on all the software needed to get deep learning models into production. However, it is important to have a basic understanding of the concepts.
"},{"location":"s1_development_environment/deep_learning_software/#exercises","title":"\u2754 Exercises","text":"Exercise files
Start a jupyter notebook session in your terminal (assuming you are standing in the root of the course material). Alternatively you should be able to open the notebooks directly in your code editor. For VS code users you can read more about how to work with jupyter notebooks in VS code here
Complete the Tensors in Pytorch notebook. It focuses on basic manipulation of Pytorch tensors. You can pass this notebook if you are comfortable doing this.
Complete the Neural Networks in Pytorch notebook. It focuses on building a very simple neural network using the Pytorch nn.Module
interface.
Complete the Training Neural Networks notebook. It focuses on how to write a simple training loop for training a neural network.
Complete the Fashion MNIST notebook, that summaries concepts learned in the notebook 2 and 3 on building a neural network for classifying the Fashion MNIST dataset.
Complete the Inference and Validation notebook. This notebook adds important concepts on how to do inference and validation on our neural network.
Complete the Saving_and_Loading_Models notebook. This notebook addresses how to save and load model weights. This is important if you want to share a model with someone else.
If tensor a
has shape [N, d]
and tensor b
has shape [M, d]
how can we calculate the pairwise distance between rows in a
and b
without using a for loop?
We can take advantage of broadcasting to do this
a = torch.randn(N, d)\nb = torch.randn(M, d)\ndist = torch.sum((a.unsqueeze(1) - b.unsqueeze(0))**2, dim=2) # shape [N, M]\n
What should be the size of S
for an input image of size 1x28x28, and how many parameters does the neural network then have?
from torch import nn\nneural_net = nn.Sequential(\n nn.Conv2d(1, 32, 3), nn.ReLU(), nn.Conv2d(32, 64, 3), nn.ReLU(), nn.Flatten(), nn.Linear(S, 10)\n)\n
Solution Since both convolutions have a kernel size of 3, stride 1 (default value) and no padding that means that we lose 2 pixels in each dimension, because the kernel can not be centered on the edge pixels. Therefore, the output of the first convolution would be 32x26x26. The output of the second convolution would be 64x24x24. The size of S
must therefore be 64 * 24 * 24 = 36864
. The number of parameters in a convolutional layer is kernel_size * kernel_size * in_channels * out_channels + out_channels
(last term is the bias) and the number of parameters in a linear layer is in_features * out_features + out_features
(last term is the bias). Therefore, the total number of parameters in the network is 3*3*1*32 + 32 + 3*3*32*64 + 64 + 36864*10 + 10 = 387,466
, which could be calculated by running:
sum([prod(p.shape) for p in neural_net.parameters()])\n
A working training loop in Pytorch should have these three function calls: optimizer.zero_grad()
, loss.backward()
, optimizer.step()
. Explain what would happen in the training loop (or implement it) if you forgot each of the function calls.
optimizer.zero_grad()
is in charge of zeroring the gradient. If this is not done, then gradients would accumulate over the steps leading to exploding gradients. loss.backward()
is in charge of calculating the gradients. If this is not done, then the gradients would not be calculated and the optimizer would not be able to update the weights. optimizer.step()
is in charge of updating the weights. If this is not done, then the weights would not be updated and the model would not learn anything.
As the final exercise we will develop a simple baseline model which we will continue to develop on during the course. For this exercise we provide the data in the data/corruptmnist
folder. Do NOT use the data in the corruptmnist_v2
folder as that is intended for another exercise. As the name suggest this is a (subsampled) corrupted version of regular MNIST. Your overall task is the following:
Implement a MNIST neural network that achieves at least 85 % accuracy on the test set.
Before any training can start, you should identify what corruption that we have applied to the MNIST dataset to create the corrupted version. This can help you identify what kind of neural network to use to get good performance, but any network should really be able to achieve this.
One key point of this course is trying to stay organized. Spending time now organizing your code, will save time in the future as you start to add more and more features. As subgoals, please fulfill the following exercises
Implement your model in a script called model.py
Implement your data setup in a script called data.py
. The data was saved using torch.save
, so to load it you should use torch.load
.
Saving the model
When saving the model, you should use torch.save(model.state_dict(), \"model.pt\")
and when loading the model you should use model.load_state_dict(torch.load(\"model.pt\"))
. If you do torch.save(model, \"model.pt\")
this can lead to problems when loading the model later on, as it will try to not only save the model weights but also the model definition. This can lead to problems if you change the model definition later on (which you most likely is going to do).
Implement training and evaluation of your model in main.py
script. The main.py
script should be able to take an additional subcommands indicating if the model should train or evaluate. It will look something like this:
python main.py train --lr 1e-4\npython main.py evaluate trained_model.pt\n
which can be implemented in various ways.
VS code and command line argumentsIf you try to execute the above code in VS code using the debugger (F5) or the build in run functionality in the upper right corner:
you will get an error message saying that you need to select a command to run e.g. main.py
either needs the train
or evaluate
command. This can be fixed by adding a lunch.json
to a specialized .vscode
folder in the root of the project. The lunch.json
file should look something like this:
{\n \"version\": \"0.2.0\",\n \"configurations\": [\n {\n \"name\": \"Python: Current File\",\n \"type\": \"python\",\n \"request\": \"launch\",\n \"program\": \"${file}\",\n \"args\": [\n \"train\",\n \"--lr\",\n \"1e-4\"\n ],\n \"console\": \"integratedTerminal\",\n \"justMyCode\": true\n }\n ]\n}\n
This will inform VS code that then we execute the current file (in this case main.py
) we want to run it with the train
command and additionally pass the --lr
argument with the value 1e-4
. You can read more about creating a lunch.json
file here. If you want to have multiple configurations you can add them to the configurations
list as additional dictionaries.
To start you off, a very basic version of each script is provided in the final_exercise
folder. We have already implemented some logic, especially to make sure you can easily run different subcommands in for step 4. If you are interested in how this is done you can checkout this optional module on defining command line interfaces (CLI). We additionally also provide an requirements.txt
with suggestion to what packages are necessary to complete the exercise.
As documentation that your model is actually working, when running in the train
command the script needs to produce a single plot with the training curve (training step vs training loss). When the evaluate
command is run, it should write the test set accuracy to the terminal.
It is part of the exercise to not implement in notebooks as code development in the real life happens in script. As the model is simple to run (for now) you should be able to complete the exercise on your laptop, even if you are only training on cpu. That said you are allowed to upload your scripts to your own \"Google Drive\" and then you can call your scripts from a Google Colab notebook, which is shown in the image below where all code is place in the fashion_trainer.py
script and the Colab notebook is just used to execute it.
Be sure to have completed the final exercise before the next session, as we will be building on top of the model you have created.
"},{"location":"s1_development_environment/editor/","title":"M3 - Editor","text":""},{"location":"s1_development_environment/editor/#editoride","title":"Editor/IDE","text":"Core Module
Notebooks can be great for testing out ideas, developing simple code and explaining and visualizing certain aspects of a codebase. Remember that Jupyter notebook was created with intention to \"...allows you to create and share documents that contain live code, equations, visualizations and narrative text.\" However, any larger machine learning project will require you to work in multiple .py
files and here notebooks will provide a suboptimal workflow. Therefore, to for truly getting \"work done\" you will need a good editor / IDE.
Many opinions exist on this matter, but for simplicity we recommend getting started with one of the following 3:
Editor Webpage Comment (Biased opinion) Spyder https://www.spyder-ide.org/ Matlab like environment that is easy to get started with Visual studio code https://code.visualstudio.com/ Support for multiple languages with fairly easy setup PyCharm https://www.jetbrains.com/pycharm/ IDE for python professionals. Will take a bit of time getting used toWe highly recommend Visual studio (VS) code if you do not already have a editor installed (or just want to try something new.). We therefore put additional effort into explaining VS code.
Below you see an overview of the vs code interface
Image creditThe main components of VS code are:
The action bar: VS code is not an editor meant for a single language and can do many things. One of the core reasons that VS code have become so popular is that custom plug-ins called extensions can be installed to add functionality to VS code. It is in the action bar that you can navigate between these different applications when you have installed them.
The side bar: The side bar has different functionality depending on what extension that you have open. In most cases, the side bar will just contain the file explorer.
The editor: This where you code is. VS code supports a number of layouts in the editor (one column, two column etc.). You can make a custom layout by dragging a file to where you want the layout to split.
The panel: The panel contains a terminal for you to interact with. This can quickly be used to try out code by opening a python
interpreter, management of environments etc.
The status bar: The status bar contains information based on the extensions that you have installed. In particular for python development, the status bar can be used to change conda environment.
The overall goal of the exercises, is that you should start familiarizing yourself with the editor that you have chosen. If you are already an expert in one of them, feel free to skip the rest. You should at least be able to:
The instructions below are specific to Visual studio code but we recommend that you try to answer the questions if using another editor. In the exercise_files
folder belonging to this session we have put cheat sheets for VS code (one for Windows and one for Mac/Linux), that can give you an easy overview of the different macros in VS code. The following exercises are just to get you started but you can find many more tutorials here.
VS code is a general editor for many languages and to get proper python support we need to install some extensions. In the action bar
go to the extension
tap and search for python
in the marketplace. For here we highly recommend installing the following packages:
If you install the Python
package you should see something like this in your status bar:
which indicates that you are using the stock python installation, instead of the one you have created using conda
. Click it and change the python environment to the one you actually want to use.
One of the most useful tools in VS Code is the ability to navigate the whole project using the built-in Explorer
. To really take advantage of the VS code you need to make sure what you are working on is a project. Create a folder called hello
(somewhere on your laptop) and open it in VS Code (Click File
in the menu and then select Open Folder
). You should end up with a completely clean workspace (as shown below). Click the New file
button and create a file called hello.py
.
Image credit
Finally, lets run some code. Add something simple to the hello.py
file like:
Image credit
and click the run
button as shown in the image. It should create a new terminal, activate the environment that you have chosen and finally run your script. In addition to clicking the run
button, you can also
Shift+Enter
to run it in the terminalThat's, the basic of using VS code. We recommend highly that you revisit this tutorial during the course when we get to topics such as debugging and version control which VS code can help with.
"},{"location":"s1_development_environment/editor/#a-note-on-jupyter-notebooks-in-production-environments","title":"A note on jupyter notebooks in production environments","text":"As already stated jupyter notebooks are great for development as they allow developers to easily test our new ideas. However, they often lead to pain points when models actually need to be deployed. We highly recommend reading section 5.1.1 of this paper by Shankar et al. that in more detail discuss the strong opinions to jupyter notebooks that exist within the developer community.
All this said there at least exist one simple tool to make notebooks work better in a production setting. Its called nbconvert
and can be installed with
conda install nbconvert # or pip install nbconvert\n
You may need some further dependencies such as Pandoc, TeX and Pyppeteer for it to work (see install instructions here). After this, converting a notebook to a .py
script is a simple as:
jupyter nbconvert --to=script my_notebook.ipynb\n
which will produce a similar named script called my_notebook.py
. We highly recommend that you stick to developing scripts directly during the course to get experience with doing so, but nbconvert
can be an fantastic tool to have in your toolbox.
Core Module
Python is a great programming language and this is mostly due to its vast ecosystem of packages. No matter what you want to do, there is probably a package that can get you started. Just try to remember when the last time you wrote a program only using the python standard library? Probably never. For this reason, we need a way to install third-party packages and this is where package managers come into play.
You have probably already used pip
for the longest time, which is the default package manager for Python. pip
is great for beginners but it is missing one essential feature that you will need as a developer or data scientist: virtual environments. Virtual environments are an essential way to make sure that the dependencies of different projects do not cross-contaminate each other. As a naive example, consider project A that requires torch==1.3.0
and project B that requires torch==2.0
, then doing
cd project_A # move to project A\npip install torch==1.3.0 # install old torch version\ncd ../project_B # move to project B\npip install torch==2.0 # install new torch version\ncd ../project_A # move back to project A\npython main.py # try executing main script from project A\n
will mean that even though we are executing the main script from project A's folder, it will use torch==2.0
instead of torch==1.3.0
because that is the last version we installed, because in both cases pip
will install the package into the same environment, in this case the global environment. Instead, if we did something like:
cd project_A # move to project A\npython -m venv env # create a virtual environment in project A\nsource env/bin/activate # activate that virtual environment\npip install torch==1.3.0 # install old torch version into the virtual environment belonging to project A\ncd ../project_B # move to project B\npython -m venv env # create a virtual environment in project B\nsource env/bin/activate # activate that virtual environment\npip install torch==2.0 # install new torch version into the virtual environment belonging to project B\ncd ../project_A # move back to project A\nsource env/bin/activate # activate the virtual environment belonging to project A\npython main.py # succeed in executing main script from project A\n
cd project_A # move to project A\npython -m venv env # create a virtual environment in project A\n.\\env\\Scripts\\activate # activate that virtual environment\npip install torch==1.3.0 # install old torch version into the virtual environment belonging to project A\ncd ../project_B # move to project B\npython -m venv env # create a virtual environment in project B\n.\\env\\Scripts\\activate # activate that virtual environment\npip install torch==2.0 # install new torch version into the virtual environment belonging to project B\ncd ../project_A # move back to project A\n.\\env\\Scripts\\activate # activate the virtual environment belonging to project A\npython main.py # succeed in executing main script from project A\n
then we would be sure that torch==1.3.0
is used when executing main.py
in project A because we are using two different virtual environments. In the above case, we used the venv module which is the built-in Python module for creating virtual environments. venv+pip
is arguably a good combination but when working on multiple projects it can quickly become a hassle to manage all the different virtual environments yourself, remembering which Python version to use, which packages to install and so on.
For this reason, a number of package managers have been created that can help you manage your virtual environments and dependencies, with some of the most popular being:
with more being created every year (rye is looking like an interesting project). This is considered a problem in the Python community, because it means that there is no standard way of managing dependencies like in other languages like npm
for node.js
or cargo
for rust
.
In the course, we do not care about which package manager you use, but we do care that you use one. If you are already familiar with one package manager, then skip this exercise and continue to use that. The best recommendation that I can give regarding package managers, in general, is to find one you like and then stick with it. A lot of time can be wasted on trying to find the perfect package manager, but in the end, they all do the same with some minor differences. Checkout this blog post if you want a fairly up-to-date evaluation of the different environment management and packaging tools that exist in the Python ecosystem.
If you are not familiar with any package managers, then we recommend that you use conda
and pip
for this course. You probably already have conda installed on your laptop, which is great. What conda does especially well, is that it allows you to create virtual environments with different Python versions, which can be really useful if you encounter dependencies that have not been updated in a long time. In this course specifically, we are going to recommend the following workflow
conda
to create virtual environments with specific Python versionspip
to install packages in that environmentInstalling packages with pip
inside conda
environments has been considered a bad practice for a long time, but since conda>=4.6
it is considered safe to do so. The reason for this is that conda
now has a built-in compatibility layer that makes sure that pip
installed packages are compatible with the other packages installed in the environment.
Before we get started with the exercises, let's first talk a bit about Python dependencies. One of the most common ways to specify dependencies in the Python community is through a requirements.txt
file, which is a simple text file that contains a list of all the packages that you want to install. The format allows you to specify the package name and version number you want, with 7 different operators:
package1 # any version\npackage2 == x.y.z # exact version\npackage3 >= x.y.z # at least version x.y.z\npackage4 > x.y.z # newer than version x.y.z\npackage4 <= x.y.z # at most version x.y.z\npackage5 < x.y.z # older than version x.y.z\npackage6 ~= x.y.z # install version newer than x.y.z and older than x.y+1\n
In general, all packages (should) follow the semantic versioning standard, which means that the version number is split into three parts: x.y.z
where x
is the major version, y
is the minor version and z
is the patch version.
The reason that we need to specify the version number is that we want to make sure that we can reproduce our code at a later point. If we do not specify the version number, then we are at the mercy of the package maintainer to not change the API of the package. This is especially important when working with machine learning models, as we want to make sure that we can reproduce the exact same model at a later point.
Finally, we also need to discuss dependency resolution, which is the process of figuring out which packages are compatible. This is a non-trivial problem, and there exist a lot of different algorithms for doing this. If you have ever thought that pip
and conda
were taking a long time to install something, then it is probably because they were trying to figure out which packages are compatible with each other. For example, if you try to install
pip install \"matplotlib >= 3.8.0\" \"numpy <= 1.19\" --dry-run\n
then it would simply fail because there are no versions of matplotlib
and numpy
under the given constraints that are compatible with each other. In this case, we would need to relax the constraints to something like
pip install \"matplotlib >= 3.8.0\" \"numpy <= 1.21\" --dry-run\n
to make it work.
"},{"location":"s1_development_environment/package_manager/#exercises","title":"\u2754 Exercises","text":"For hints regarding how to use conda
you can check out the cheat sheet in the exercise folder.
Download and install conda
. You are free to either install full conda
or the much simpler version miniconda
. The core difference between the two packages is that conda
already comes with a lot of packages that you would normally have to install with miniconda
. The downside is that conda
is a much larger package which can be a huge disadvantage on smaller devices. Make sure that your installation is working by writing conda help
in a terminal and it should show you the help message for conda. If this does not work you probably need to set some system variable to point to the conda installation
If you have successfully installed conda, then you should be able to execute the conda
command in a terminal.
Conda will always tell you what environment you are currently in, indicated by the (env_name)
in the prompt. By default it will always start in the (base)
environment.
Try creating a new virtual environment. Make sure that it is called my_enviroment
and that it installs version 3.11 of Python. What command should you execute to do this?
We highly recommend that you use Python 3.8 or higher for this course. In general, we recommend that you use the second latest version of Python that is available (currently Python 3.11 as of writing this). This is because the latest version of Python is often not supported by all dependencies. You can always check the status of different Python version support here.
Which conda
command gives you a list of all the environments that you have created?
Which conda
command gives you a list of the packages installed in the current environment?
How do you easily export this list to a text file? Do this, and make sure you export it to a file called enviroment.yaml
, as conda uses another format by default than pip
.
Inspect the file to see what is in it.
The enviroment.yaml
file you have created is one way to secure reproducibility between users because anyone should be able to get an exact copy of you environment if they have your enviroment.yaml
file. Try creating a new environment directly from you enviroment.yaml
file and check that the packages being installed exactly match what you originally had.
As the introduction states, it is fairly safe to use pip
inside conda
today. What is the corresponding pip
command that gives you a list of all pip
installed packages? And how do you export this to requirements.txt
file?
If you look through the requirements that both pip
and conda
produce then you will see that it is often filled with a lot more packages than what you are actually using in your project. What you are really interested in are the packages that you import in your code: from package import module
. One way to get around this is to use the package pipreqs
, which will automatically scan your project and create a requirements file specific to that. Let's try it out:
Install pipreqs
:
pip install pipreqs\n
Either try out pipreqs
on one of your own projects or try it out on some other online project. What does the requirements.txt
file pipreqs
produces look like compared to the files produced by either pip
or conda
.
Try executing the command
pip install \"pytest < 4.6\" pytest-cov==2.12.1\n
based on the error message you get, what would be a compatible way to install these?
SolutionAs pytess-cov==2.12.1
requires a version of pytest
newer than 4.6
, we can simply change the command to be:
pip install \"pytest >= 4.6\" pytest-cov==2.12.1\n
but there of course exists other solutions as well.
This ends the module on setting up virtual environments. While the methods mentioned in the exercises are great ways to construct requirements files automatically, sometimes it is just easier to manually sit down and create the files as you in that way secure that only the most necessary requirements are installed when creating a new environment.
"},{"location":"s2_organisation_and_version_control/","title":"Getting started with MLOps - Organization and version control","text":"Slides
Today we take our first steps into the world of MLOps. The set of modules in this session focuses on getting organized and making sure that you are familiar with good development practices. While many of the practices you will learn about these modules does not seem that important when you are a single person working on a project, it is crucial when working in large groups that the difference in how different people organize and write their code is minimized. The topics in this session will focus on:
Some exercises in this course are very loosely stated (including the exercises today). You are expected to seek out information before you ask for help (Google is your friend!) as you will both learn more for trying to solve the problems yourself, and it is more realistic of how the \"real world\" works.
Learning objectives
The learning objectives of this session are:
git
to track changes to your codedvc
to version control dataCore Module
With a basic understanding of version control, it is now time to really begin filling up our code repository. However, the question then remains how to organize our code? As developers we tend to not think about code organization that much. It is instead something that just dynamically is being created as we may need it. However, maybe we should spend some time initially getting organized with the chance of this making our code easier to develop and maintain in the long run. If we do not spend time organizing our code, we may end up with a mess of code that is hard to understand or maintain
Big ball of Mud
A Big Ball of Mud is a haphazardly structured, sprawling, sloppy, duct-tape-and-baling-wire, spaghetti-code jungle. These systems show unmistakable signs of unregulated growth, and repeated, expedient repair. Information is shared promiscuously among distant elements of the system, often to the point where nearly all the important information becomes global or duplicated. The overall structure of the system may never have been well defined. If it was, it may have eroded beyond recognition. Programmers with a shred of architectural sensibility shun these quagmires. Only those who are unconcerned about architecture, and, perhaps, are comfortable with the inertia of the day-to-day chore of patching the holes in these failing dikes, are content to work on such systems. Brian Foote and Joseph Yoder, Big Ball of Mud. Fourth Conference on Patterns Languages of Programs (PLoP '97/EuroPLoP '97) Monticello, Illinois, September 1997
We are here going to focus on the organization of data science projects and machine learning projects. The core difference this kind of projects introduces compared to more traditional systems, is data. The key to modern machine learning is without a doubt the vast amounts of data that we have access to today. It is therefore not unreasonable that data should influence our choice of code structure. If we had another kind of application, then the layout of our codebase should probably be different.
"},{"location":"s2_organisation_and_version_control/code_structure/#cookiecutter","title":"Cookiecutter","text":"We are in this course going to use the tool cookiecutter, which is tool for creating projects from project templates. A project template is in short just na overall structure of how you want your folders, files etc. to be organised from the beginning. For this course we are going to be using a custom MLOps template. The template is essentially a fork of the cookiecutter data science template template that has been used for a couple of years in the course, but specialized a bit more towards MLOps instead of general data science.
We are not going to argue that this template is better than every other template, we are just focusing on that it is a standardized way of creating project structures for machine learning projects. By standardized we mean, that if two persons are both using cookiecutter
with the same template, the layout of their code does follow some specific rules, enabling one to faster understand the other person's code. Code organization is therefore not only to make the code easier for you to maintain but also for others to read and understand.
Below is seen the default code structure of cookiecutter for data science projects.
What is important to keep in mind when using a template, is that it exactly is a template. By definition a template is guide to make something. Therefore, not all parts of an template may be important for your project at hand. Your job is to pick the parts from the template that is useful for organizing your machine learning project and add the parts that are missing.
"},{"location":"s2_organisation_and_version_control/code_structure/#python-projects","title":"Python projects","text":"While the same template in principal could be used regardless of what language we were using for our machine learning or data science application, there are certain considerations to take into account based on what language we are using. Python is the dominant language for machine learning and data science currently, which is why we in this section are focusing on some of the special files you will need for your Python projects.
The first file you may or may not know is the __init__.py
file. In Python the __init__.py
file is used to mark a directory as a Python package. Therefore as a bare minimum, any Python package should look something like this:
\u251c\u2500\u2500 src/\n\u2502 \u251c\u2500\u2500 __init__.py\n\u2502 \u251c\u2500\u2500 file1.py\n\u2502 \u251c\u2500\u2500 file2.py\n\u251c\u2500\u2500 pyproject.toml\n
The second file to focus on is the pyproject.toml
. This file is important for actually converting your code into a Python project. Essentially, whenever you run pip install
, pip
is in charge of both downloading the package you want but also in charge of installing it. For pip
to be able to install a package it needs instructions on what part of the code it should install and how to install it. This is the job of the pyproject.toml
file.
Below we have both added a description of the structure of the pyproject.toml
file but also setup.py + setup.cfg
which is the \"old\" way of providing project instructions regarding Python project. However, you may still encounter a lot of projects using setup.py + setup.cfg
so it is good to at least know about them.
pyproject.toml
is the new standardized way of describing project metadata in a declaratively way, introduced in PEP 621. It is written toml format which is easy to read. At the very least your pyproject.toml
file should include the [build-system]
and [project]
sections:
[build-system]\nrequires = [\"setuptools\", \"wheel\"]\nbuild-backend = \"setuptools.build_meta\"\n\n[project]\nname = \"my-package-name\"\nversion = \"0.1.0\"\nauthors = [{name = \"EM\", email = \"me@em.com\"}]\ndescription = \"Something cool here.\"\nrequires-python = \">=3.8\"\ndynamic = [\"dependencies\"]\n\n[tool.setuptools.dynamic]\ndependencies = {file = [\"requirements.txt\"]}\n
the [build-system]
informs pip
/python
that to build this Python project it needs the two packages setuptools
and wheels
and that it should call the setuptools.build_meta function to actually build the project. The [project]
section essentially contains metadata regarding the package, what its called etc. if we ever want to publish it to PyPI.
For specifying dependencies of your project you have two options. Either you specify them in a requirements.txt
file and it as a dynamic field in pyproject.toml
as shown above. Alternatively, you can add a dependencies
field under the [project]
header like this:
[project]\ndependencies = [\n 'torch==2.1.0',\n 'matplotlib>=3.8.1'\n]\n
The improvement over setup.py + setup.cfg
is that pyproject.toml
also allows for metadata from other tools to be specified in it, essentially making sure you only need a single file for your project. For example, in the next [module M7 on good coding practices] you will learn about the tool ruff
and how it can help format your code. If we want to configure ruff
for our project we can do that directly in pyproject.toml
by adding additional headers:
[ruff]\nruff_option = ...\n
To read more about how to specify pyproject.toml
this page is a good place to start.
setup.py
is the original way to describing how a Python package should be build. The most basic setup.py
file will look like this:
from setuptools import setup\nfrom pip.req import parse_requirements\nrequirements = [str(ir.req) for ir in parse_requirements(\"requirements.txt\")]\nsetup(\n name=\"my-package-name\",\n version=\"0.1.0\",\n author=\"EM\",\n description=\"Something cool here.\"\n install_requires=requirements,\n)\n
Essentially, the it is the exact same meta information as in pyproject.toml
, just written directly in Python syntax instead of toml
. Because there was a wish to deperate this meta information into a separate file, the setup.cfg
file was created which can contain the exact same information as setup.py
just in a declarative config.
[metadata]\nname = my-package-name\nversion = 0.1.0\nauthor = EM\ndescription = \"Something cool here.\"\n# ...\n
This non-standardized way of providing meta information regarding a package was essentially what lead to the creation of pyproject.toml
.
Regardless of what way a project is configured, after creating the above files the correct way to install them would be the same
pip install .\n# or in developer mode\npip install -e . # (1)!\n
-e
is short for --editable
mode also called developer mode. Since we will continuously iterating on our package this is the preferred way to install our package, because that means that we do not have to run pip install
every time we make a change. Essentially, in developer mode changes in the Python source code can immediately take place without requiring a new installation.after running this your code should be available to import as from project_name import ...
like any other Python package you use. This is the most essential you need to know about creating Python packages.
After having installed cookiecutter (exercise 1 and 2), the remaining exercises are intended to be used on taking the simple CNN MNIST classifier from yesterdays exercise and force it into this structure. You are not required to fill out every folder and file in the project structure, but try to at least follow the steps in exercises. Whenever you need to run a file I recommend always doing this from the root directory e.g.
python <project_name>/data/make_dataset.py data/raw data/processed\npython <project_name>/models/train_model.py <arguments>\netc...\n
in this way paths (for saving and loading files) are always relative to the root.
Install cookiecutter framework
pip install cookiecutter\n
Start a new project using this template, that is specialized for this course (1).
You do this by running the cookiecutter command using the template url:
cookiecutter <url-to-template>\n
Valid project names
When asked for a project name you should follow the PEP8 guidelines for naming packages. This means that the name should be all lowercase and if you want to separate words, you should use underscores. For example my_project
is a valid name, while MyProject
is not. Additionally, the packaage name cannot start with a number.
There are two common choices on how layout your source directory. The first is called src-layout where the source code is always place in a src/<project_name>
folder and the second is called flat-layout where the source code is place is just placed in a <project_name>
folder. The template we are using in this course is using the flat-layout, but there are pros and cons for both.
After having created your new project, the first step is to also create a corresponding virtual environment and install any needed requirements. If you have a virtual environment from yesterday feel free to use that else create a new. Then install the project in that environment
pip install -e .\n
Start by filling out the <project_name>/data/make_dataset.py
file. When this file runs, it should take the raw data e.g. the corrupted MNIST files from yesterday (../data/corruptmnist
) which now should be located in a data/raw
folder and process them into a single tensor, normalize the tensor and save this intermediate representation to the data/processed
folder. By normalization here we refer to making sure the images have mean 0 and standard deviation 1.
This template comes with a Makefile
that can be used to easily define common operations in a project. You do not have to understand the complete file but try taking a look at it. In particular the following commands may come in handy
make data # runs the make_dataset.py file, try it!\nmake clean # clean __pycache__ files\nmake requirements # install everything in the requirements.txt file\n
Windows users make
is a GNU build tool that is by default not available on Windows. There are two recommended ways to get it running on Windows. The first is leveraging linux subsystem for Windows which you maybe have already installed. The second option is utilizing the chocolatey package manager, which enables Windows users to install packages similar to Linux system.
In general we recommend that you add commands to the Makefile
as you move along in the course. If you want to know more about how to write Makefile
s then this is an excellent video.
Put your model file (model.py
) into <project_name>/models
folder together and insert the relevant code from the main.py
file into the train_model.py
file. Make sure that whenever a model is trained and it is saved, that it gets saved to the models
folder (preferably in sub-folders).
When you run train_model.py
, make sure that some statistics/visualizations from the trained models gets saved to the reports/figures/
folder. This could be a simple .png
of the training curve.
(Optional) Can you figure out a way to add a train
command to the Makefile
such that training can be started using
make train\n
Fill out the newly created <project_name>/models/predict_model.py
file, such that it takes a pre-trained model file and creates prediction for some data. Recommended interface is that users can give this file either a folder with raw images that gets loaded in or a numpy
or pickle
file with already loaded images e.g. something like this
python <project_name>/models/predict_model.py \\\n models/my_trained_model.pt \\ # file containing a pretrained model\n data/example_images.npy # file containing just 10 images for prediction\n
Fill out the file <project_name>/visualization/visualize.py
with this (as minimum, feel free to add more visualizations)
reports/figures/
folder.(Optional) Feel free to create more files/visualizations (what about investigating/explore the data distribution?)
Make sure to update the README.md
file with a short description on how your scripts should be run
Finally make sure to update the requirements.txt
file with any packages that are necessary for running your code (see this set of exercises for help)
(Optional) Lets say that you are not satisfied with the template I have recommended that you use, which is completely fine. What should you then do? You should of course create your own template! This is actually not that hard to do.
Just for a starting point I would recommend that you fork either the mlops template which you have already been using or alternatively fork the data science template template.
After forking the template, clone it down locally and lets start modifying it. The first step is changing the cookiecutter.json
file. For the mlops template it looks like this:
{\n \"project_name\": \"project_name\",\n \"repo_name\": \"{{ cookiecutter.project_name.lower().replace(' ', '_') }}\",\n \"author_name\": \"Your name (or your organization/company/team)\",\n \"description\": \"A short description of the project.\",\n \"python_version_number\": \"3.10\",\n \"open_source_license\": [\"No license file\", \"MIT\", \"BSD-3-Clause\"]\n}\n
simply add a new line to the json file with the name of the variable you want to add and the default value you want it to have.
The actual template is located in the {{ cookiecutter.project_name }}
folder. cookiecutter
works by replacing everywhere that it sees {{ cookiecutter.<variable_name> }}
with the value of the variable. Therefore, if you want to add a new file to the template, just add it to the {{ cookiecutter.project_name }}
folder and make sure to add the {{ cookiecutter.<variable_name> }}
where you want the variable to be replaced.
After you have made the changes you want to the template, you should test it locally. Just run
cookiecutter . -f --no-input\n
and it should create a new folder using the default values of the cookiecutter.json
file.
Finally, make sure to push any changes you made to the template to GitHub, such that you in the future can use it by simply running
cookiecutter https://github.com/<username>/<my_template_repo>\n
Starting from complete scratch, what is the steps needed to create a new github repository and push a specific template to it as the very first commit.
SolutionCreate a completely barebone repository, either using the GitHub UI or if you have the github cli installed (not git
) you can run
gh repo create <repo_name> --public --confirm\n
Run cookiecutter
with the template you want to use
cookiecutter <template>\n
The name of the folder created by cookiecutter
should be the same as you just used.
Run the following sequence of commands
cd <project_name>\ngit init\ngit add .\ngit commit -m \"Initial commit\"\ngit remote add origin https://github.com/<username>/<repo_name>\ngit push origin master\n
That's it. The template should now have been pushed to the repository as the first commit.
That ends the module on code structure and cookiecutter
. We again want to stress the point of using cookiecutter
is not about following one specific template, but instead just to use any template for organizing your code. What often happens in a team is that multiple templates are needed in different stages of the development phase or for different product types because they share common structure, while still having some specifics. Keeping templates up-to-date then becomes critical such that no team member is using an outdated template. If you ever end up in this situation, we highly recommend to checkout cruft that works alongside cookiecutter
to not only make projects but update existing ones as template evolves. Cruft additionally also has template validation capabilities to ensure projects match the latest version of a template.
Core Module
In this module, we are going to return to version control. However, this time we are going to focus on version control of data. The reason we need to separate between standard version control and data version control comes down to one problem: size.
Classic version control was developed to keep track of code files, which all are simple text files. Even a codebase that contains 1000+ files with millions of lines of code can probably be stored in less than a single gigabyte (GB). On the other hand, the size of data can be drastically bigger. As most machine learning algorithms only get better with the more data that you feed them, we are seeing models today that are being trained on petabytes of data (1.000.000 GB).
Because this is an important concept there exist a couple of frameworks that have specialized in versioning data such as DVC, DAGsHub, Hub, Modelstore and ModelDB. Regardless of what framework, they all implement somewhat the same concept: instead of storing the actual data files or in general storing any large artifacts files we instead store a pointer to these large flies. We then version control the point instead of the artifact.
Image creditWe are in this course going to use DVC
provided by iterative.ai as they also provide tools for automatizing machine learning, which we are going to focus on later.
DVC (Data Version Control) is simply an extension of git
to not only take versioning data but also models and experiments in general. But how does it deal with these large data files? Essentially, DVC
will just keep track of a small metafile that will then point to some remote location where your original data is stored. Metafiles essentially work as placeholders for your data files. Your large data files are then stored in some remote location such as Google Drive or an S3
bucket from Amazon.
Image credit
As the figure shows, we now have two remote locations: one for code and one for data. We use git pull/push
for the code and dvc pull/push
for the data. The key concept is the connection between the data file model.pkl
which is fairly large and its respective metafile model.pkl.dvc
which is very small. The large file is stored in the data remote and the metafile is stored in the code remote.
If in doubt about some of the exercises, we recommend checking out the documentation for DVC as it contains excellent tutorials.
For these exercises, we are going to use Google drive as a remote storage solution for our data. If you do not already have a Google account, please create one (we are going to use it again in later exercises). Please make sure that you at least have 1GB of free space.
Next, install DVC and the Google Drive extension
pip install dvc\npip install \"dvc[gdrive]\"\n
If you installed DVC via pip and plan to use cloud services as remote storage, you might need to install these optional dependencies: [s3], [azure], [gdrive], [gs], [oss], [ssh]. Alternatively, use [all] to include them all. If you encounter that the installation fails, we recommend that you start by updating pip and then trying to update dvc
:
pip install -U pip\npip install -U \u201ddvc[gdrive]\u201d\n
If this does not work for you, it is most likely due to a problem with pygit2
and in that case we recommend that you follow the instructions here.
In your MNIST repository run the following command from the terminal
dvc init\n
this will setup dvc
for this repository (similar to how git init
will initialize a git repository). These files should be committed using standard git
to your repository.
Go to your Google Drive and create a new folder called dtu_mlops_data
. Then copy the unique identifier belonging to that folder as shown in the figure below
Using this identifier, add it as a remote storage
dvc remote add -d storage gdrive://<your_identifier>\n
Check the content of the file .dvc/config
. Does it contain a pointer to your remote storage? Afterwards, make sure to add this file to the next commit we are going to make:
git add .dvc/config\n
Call the dvc add
command on your data files exactly like you would add a file with git
(you do not need to add every file by itself as you can directly add the data/
folder). Doing this should create a human-readable file with the extension .dvc
. This is the metafile as explained earlier that will serve as a placeholder for your data. If you are on Windows and this step fails you may need to install pywin32
. At the same time, the data
folder should have been added to the .gitignore
file that marks which files should not be tracked by git. Confirm that this is correct.
Now we are going to add, commit and tag the metafiles so we can restore to this stage later on. Commit and tag the files, which should look something like this:
git add data.dvc .gitignore\ngit commit -m \"First datasets, containing 25000 images\"\ngit tag -a \"v1.0\" -m \"data v1.0\"\n
Finally, push your data to the remote storage using dvc push
. You will be asked to authenticate, which involves copy-pasting the code in the link prompted. Check out your Google Drive folder. You will see that the data is not in a recognizable format anymore due to the way that dvc
packs and tracks the data. The boring detail is that dvc
converts the data into content-addressable storage which makes data much faster to get. Finally, make sure that your data is not stored in your Github repository.
After authenticating the first time, DVC
should be setup without having to authenticate again. If you for some reason encounter that dvc fails to authenticate, you can try to reset the authentication. Locate the file $CACHE_HOME/pydrive2fs/{gdrive_client_id}/default.json
where $CACHE_HOME
depends on your operating system:
~/Library/Caches
~/.cache
This is the typical location, but it may vary depending on what distro you are running
{user}/AppData/Local
Delete the complete {gdrive_client_id}
folder and retry authenticating with dvc push
.
After completing the above steps, it is very easy for others (or yourself) to get setup with both code and data by simply running
git clone <my_repository>\ncd <my_repository>\ndvc pull\n
(assuming that you give them access right to the folder in your drive). Try doing this (in some other location than your standard code) to make sure that the two commands indeed download both your code and data.
Lets look about the process of updating our data. Remember the important aspect of version control is that we do not need to store explicit files called data_v1.pt
, data_v2.pt
etc. but just have a single data.pt
that where we can always checkout earlier versions. Initially start by copying the data data/corruptmnist_v2
folder from this repository to your MNIST code. This contains 3 extra datafiles with 15000 additional observations. Rerun your data pipeline so these gets incorporated into the files in your processed
folder.
Redo the above steps, adding the new data using dvc
, committing and tagging the metafiles e.g. the following commands should be executed (with appropriate input):
dvc add -> git add -> git commit -> git tag -> dvc push -> git push
.
Lets say that you wanted to go back to the state of your data in v1.0. If the above steps have been done correctly, you should be able to do this using:
git checkout v1.0\ndvc checkout\n
confirm that you have reverted to the original data.
(Optional) Finally, it is important to note that dvc
is not only intended to be used to store data files but also any other large files such as trained model weights (with billions of parameters these can be quite large). For example, if we always store our best-performing model in a file called best_model.ckpt
then we can use dvc
to version control it, store it online and make it easy for others to download. Feel free to experiment with this using your model checkpoints.
In general dvc
is a great framework for version-controlling data and models. However, it is important to note that it does have some performance issue when dealing with datasets that consist of many files. Therefore, if you are ever working with a dataset that consists of many small files, it can be a good idea to:
zip files into a single archive and then version control the archive. The zip
archive should be placed in a data/raw
folder and then unzipped in the data/processed
folder.
If possible turn your data into 1D arrays, then it can be stored in a single file such as .parquet
or .csv
. This is especially useful for tabular data. Then you can version control the single file instead of the many files.
How do you know that a repository is using dvc?
SolutionSimilar to a git repository having a .git
directory, a repository using dvc needs to have a .dvc
folder. Alternatively you can you the dvc status
command.
Assume you just added a folder called data/
that you want to track with dvc
. What is the sequence of 5 commands to successful version control the folder? (assuming you already setup a remote)
dvc add data/\ngit add .\ngit commit -m \"added raw data\"\ngit push\ndvc push\n
That's all for today. With the combined power of git
and dvc
we should be able to version control everything in our development pipeline such that no changes are lost (assuming we commit regularly). It should be noted that dvc
offers more than just data version control, so if you want to deep dive into dvc
we recommend their pipeline feature and how this can be used to setup version controlled experiments. Note that we are going to revisit dvc
later for a more permanent (and large-scale) storage solution.
Core Module
Proper collaboration with other people will require that you can work on the same codebase in an organized manner. This is the reason that version control exist. Simply stated, it is a way to keep track of:
For a full explanation please see this page
Secondly, it is important to note that GitHub is not git! GitHub is the dominating player when it comes to hosting repositories but that does not mean that they are the only one providing free repository hosting (see bitbucket or gitlab) for some other examples).
That said we will be using git and GitHub throughout this course. It is a requirement for passing this course that you create a public repository with your code and use git to upload any code changes. How much you choose to integrate this into your own projects depends, but you are at least expected to be familiar with git+GitHub.
Image credit"},{"location":"s2_organisation_and_version_control/git/#initial-config","title":"Initial config","text":"What does Git stand for?
The name \"git\" was given by Linus Torvalds when he wrote the very first version. He described the tool as \"the stupid content tracker\" and the name as (depending on your mood):
Install git on your computer and make sure that your installation is working by writing git help
in a terminal and it should show you the help message for git.
Create a GitHub account if you do not already have one.
To make sure that we do not have to type in our GitHub username every time that we want to do some changes, we can once and for all set them on our local machine
# type in a terminal\ngit config credential.helper store\ngit config --global user.email <email>\n
The most simple way to think of version control, is that it is just nodes with lines connecting them
Each node, which we call a commit is uniquely identified by a hash string. Each node, stores what our code looked like at that point in time (when we made the commit) and using the hash codes we can easily revert to a specific point in time.
The commits are made up of local changes that we make to our code. A basic workflow for adding commits are seen below
Assuming that we have made some changes to our local working directory and that we want to get these updates to be online in the remote repository we have to do the following steps:
First we run the command git add
. This will move our changes to the staging area. While changes are in the staging area we can very easily revert them (using git restore
). There have therefore not been assigned a unique hash to the code yet, and we can therefore still overwrite it.
To take our code from the staging area and make it into a commit, we simply run git commit
which will locally add a note to the graph. It is important again, that we have not pushed the commit to the online repository yet.
Finally, we want others to be able to use the changes that we made. We do a simple git push
and our commit gets online
Of course, the real power of version control is the ability to make branches, as in the image below
Image creditEach branch can contain code that are not present on other branches. This is useful when you are many developers working together on the same project.
"},{"location":"s2_organisation_and_version_control/git/#exercises","title":"\u2754 Exercises","text":"In your GitHub account create an repository, where the intention is that you upload the code from the final exercise from yesterday
After creating the repository, clone it to your computer
git clone https://github.com/my_user_name/my_repository_name.git\n
Move/copy the three files from yesterday into the repository (and any other that you made)
Add the files to a commit by using git add
command (1)
Commit the files using git commit
Finally push the files to your repository using git push
. Make sure to check online that the files have been updated in your repository.
You can always use the command git status
to check where you are in the process of making a commit.
Also checkout the git log
command, which will show you the history of commits that you have made.
Make sure that you understand how to make branches, as this will allow you to try out code changes without messing with your working code. Creating a new branch can be done using:
# create a new branch\ngit checkout -b <my_branch_name>\n
Afterwards, you can use git checkout
to change between branches (remember to commit your work!) Try adding something (a file, a new line of code etc.) to the newly created branch, commit it and try changing back to master afterwards. You should hopefully see whatever you added on the branch is not present on the main branch.
If you do not already have a cloned version of this repository belonging to the course, make sure to make one! I am continuously updating/changing some of the material during the course and I therefore recommend that you each day before the lecture do a git pull
on your local copy
Git may seem like a waste of time when solutions like dropbox, google drive etc exist, and it is not completely untrue when you are only one or two working on a project. However, these file management systems falls short when hundreds to thousands of people work together. For this exercise you will go through the steps of sending an open-source contribution:
Go online and find a project you do not own, where you can improve the code. You can either look at this page of good issues to get started with or for simplicity you can just choose the repository belonging to the course. Now fork the project by clicking the Fork button.
This will create a local copy of the repository which you have complete writing access to. Note that code updates to the original repository do not update code in your local repository.
Clone your local fork of the project using git clone
.
As default your local repository will be on the main branch
(HINT: you can check this with the git status
command). It is good practice to make a new branch when working on some changes. Use the git branch
command followed by the git checkout
command to create a new branch.
You are now ready to make changes to the repository. Try to find something to improve (any spelling mistakes?). When you have made the changes, do the standard git cycle: add -> commit -> push
Go online to the original repository and go to the Pull requests
tab. Find compare
button and choose the button to compare the master branch
of the original repo with the branch that you just created in your own repository. Check the diff on the page to make sure that it contains the changes you have made.
Write a bit about the changes you have made and click Create pull request
:)
Forking a repository has the consequence that your fork and the repository that you forked can diverge. To mitigate this we can set what is called an remote upstream. Take a look on this page , and set a remote upstream for the repository you just forked.
After setting the upstream branch, we need to pull and merge any update. Take a look on this page and figure out how to do this.
As a final exercise we want to simulate a merge conflict, which happens when two users try to commit changes to exactly same lines of code in the codebase, and git is not able to resolve how the different commits should be integrated.
In your browser, open your favorite repository (it could be the one you just worked on), go to any file of your choosing and click the edit button (see image below) and make some change to the file. For example, if you choose a python file you can just import some random packages at the top of the file. Commit the change.
Make sure not to pull the change you just made to your local computer. Locally make changes to the same file in the same lines and commit them afterwards.
Now try to git pull
the online changes. What should (hopefully) happen is that git will tell you that it found a merge conflict that needs to be resolved. Open the file and you should see something like this
<<<<<<< HEAD\nthis is some content to mess with\ncontent to append\n=======\ntotally different content to merge later\n>>>>>>> master\n
this should be interpret as: everything that's between <<<<<<<
and =======
are the changes made by your local commit and everything between =======
and >>>>>>>
are the changes you are trying to pull. To fix the merge conflict you simply have to make the code in the two \"cells\" work together. When you are done, remove the identifiers <<<<<<<
, =======
and >>>>>>>
.
Finally, commit the merge and try to push.
(Optional) The above exercises have focused on how to use git from the terminal, which I highly recommend learning. However, if you are using a proper editor they also have build in support for version control. We recommend getting familiar with these features (here is a tutorial for VS Code)
How do you know if a certain directory is a git repository?
SolutionYou can check if there is a \".git\" directory. Alternative you can use the git status
command.
Explain what the file gitignore
is used for?
The file gitignore
is used to tell git which files to ignore when doing a git add .
command. This is useful for files that are not part of the codebase, but are needed for the code to run (e.g. data files) or files that contain sensitive information (e.g. .env
files that contain API keys and passwords).
You have two branches - main and devel. What sequence of commands would you need to execute to make sure that devel is in sync with main?
Solutiongit checkout main\ngit pull\ngit checkout devel\ngit merge main\n
What best practices are you familiar with regarding version control?
SolutionThat covers the basics of git to get you started. In the exercise folder you can find a git cheat sheet with the most useful commands for future reference. Finally, we want to point out another awesome feature of GitHub: in browser editor. Sometimes you have a small edit that you want to make, but still would like to do this in a IDE/editor. Or you may be in the situation where you are working from another device than your usual developer machine. GitHub has an built-in editor that can simply be enabled by changing any URL from
https://github.com/username/repository\n
to
https://github.dev/username/repository\n
Try it out on your newly created repository.
"},{"location":"s2_organisation_and_version_control/good_coding_practice/","title":"M7 - Good coding practice","text":""},{"location":"s2_organisation_and_version_control/good_coding_practice/#good-coding-practice","title":"Good coding practice","text":"Quote
Code is read more often than it is written. Guido Van Rossum (author of Python)
It is hard to define exactly what good coding practises are. But the above quote by Guido does hint at what it could be, namely that it has to do with how others observe and persive your code. In general, good coding practice is about making sure that you code is easy to read and understand, not only by others but also by your future self. The key concept to keep in mind with we are talking about good coding practice is consistency. In many cases it does not matter exactly how you choose to style your code etc., the important part is that you are consistent about it.
Image credit"},{"location":"s2_organisation_and_version_control/good_coding_practice/#documentation","title":"Documentation","text":"Most programmers have a love-hate relationship with documentation: We absolute hate writing it ourself, but love when someone else has actually taken time to add it to their code. There is no doubt about that well documented code is much easier to maintain, as you do not need to remember all details about the code to still maintain it. It is key to remember that good documentation saves more time, than it takes to write.
The problem with documentation is that there is no right or wrong way to do it. You can end up doing:
Under documentation: You document information that is clearly visible from the code and not the complex parts that are actually hard to understand.
Over documentation: Writing too much documentation will have the opposite effect on most people than what you want: there is too much to read, so people will skip it.
Writing good documentation is a skill that takes time to train, so lets try to do it.
Quote
Code tells you how; Comments tell you why. Jeff Atwood
"},{"location":"s2_organisation_and_version_control/good_coding_practice/#exercises","title":"\u2754 Exercises","text":"Go over the most complicated file in your project. Be critical and add comments where the logic behind the code is not easily understandable. (1)
In deep learning we often work with multi-dimensional tensors that constantly changes shape after each operation. It is good practise to annotate with comments when tensors undergoes some reshaping. In the following example we compute the pairwise euclidean distance between two tensors using broadcasting which results in multiple shape operations.
x = torch.randn(5, 10) # N x D\ny = torch.randn(7, 10) # M x D\nxy = x.unsqueeze(1) - y.unsqueeze(0) # (N x 1 x D) - (1 x M x D) = (N x M x D)\npairwise_euc_dist = xy.abs().pow(2.0).sum(dim=-1) # N x M\n
Add docstrings to at least two python function/methods. You can see here (example 5) a good example how to use identifiable keywords such as Parameters
, Args
, Returns
which standardizes the way of writing docstrings.
While python already enforces some styling (e.g. code should be indented in a specific way), this is not enough to secure that code from different users actually look like each other. Maybe even more troubling is that you will often see that your own style of coding changes as you become more and more experienced. This kind of difference in coding style is not that important to take care of when you are working on a personal project, but when working multiple people together on the same project it is important to consider.
The question then remains what styling you should use. This is where Pep8 comes into play, which is the official style guide for python. It is essentially contains what is considered \"good practice\" and \"bad practice\" when coding python.
The many years the most commonly used tool to check if you code is PEP8 compliant is to use flake8. However, we are in this course going to be using ruff that are quickly gaining popularity due to how fast it is and how quickly the developers are adding new features. (1)
flake8
and ruff
is what is called a linter or lint tool, which is any kind of static code analyze program that is used to flag programming errors, bugs, and styling errors.Install ruff
pip install ruff\n
Run ruff
on your project or part of your project
ruff check . # Lint all files in the current directory (and any subdirectories)\nruff check path/to/code/ # Lint all files in `/path/to/code` (and any subdirectories).\n
are you PEP8 compliant or are you a normal mortal?
You could go and fix all the small errors that ruff
is giving. However, in practice large projects instead relies on some kind of code formatter, that will automatically format your code for you to be PEP8 compliant. Some of the biggest formatters for the longest time in Python have been black and yapf, but we are going to use ruff
which also have a build in formatter that should be a drop-in replacement for black
.
Try to use ruff format
to format your code
ruff format . # Format all files in the current directory.\nruff format /path/to/file.py # Format a single file.\n
By default ruff
will apply a selection of rules when we are either checking it or formatting it. However, many more rules can be activated through configuration. If you have completed module M6 on code structure you will have encountered the pyproject.toml
file, which can store both build instructions about our package but also configuration of developer tools. Lets try to configure ruff
using the pyproject.toml
file.
One aspect that is not covered by PEP8 is how import
statements in Python should be organized. If you are like most people, you place your import
statements at the top of the file and they are ordered simply by when you needed them. A better practice is to introduce some clear structure in our imports. In older versions of this course we have used isort to do the job, but we are here going to configure ruff
to do the job. In your pyproject.toml
file add the following lines
[tool.ruff]\nselect = [\"I\"]\n
and try re-running ruff check
and ruff format
. Hopefully this should reorganize your imports to follow common practice. (1)
os
) in one block, followed by third-party dependencies (like torch
) in a second block and finally imports from your own package in a third block. Each block is then put in alphabetical order.One PEP8 styling rule that is often diverged from is the recommended line length of 79 characters, which by many (including myself) is considered very restrictive. If you code consist of multiple levels of indentation, you can quikly run into 79 characters being limiting. For this reason many projects increase it, often to 120 characters which seems to be the sweet spot of how many characters fits in a coding window on a laptop. Add the line
line-length=120\n
under the [tool.ruff]
section in the pyproject.toml
file and rerun ruff check
and ruff format
on your code.
Experiment yourself with further configuration of ruff
. In particular we recommend adding more rules and looking [tool.ruff.pydocstyle]
configuration to indicate how you have styled your documentation.
In addition to writing documentation and following a specific styling, in python we have a third way of improving the quality of our code: through typing. Typing goes back to the earlier programming languages like c
, c++
etc. where data types needed to be explicit stated for variables:
int main() {\n int x = 5 + 6;\n float y = 0.5;\n cout << \"Hello World! \" << x << std::endl();\n}\n
This is not required by python but it can really improve the readability of code, that you can directly read from the code what the expected types of input arguments and returns are. In python the :
character have been reserved for type hints. Here is one example of adding typing to a function:
def add2(x: int, y: int) -> int:\n return x+y\n
here we mark that both x
and y
are integers and using the arrow notation ->
we mark that the output type is also an integer. Assuming that we are also going to use the function for floats and torch.Tensor
s we could improve the typing by specifying a union of types. Depending on the version of python you are using the syntax for this can be different.
from torch import Tensor # note it is Tensor with upper case T. This is the base class of all tensors\nfrom typing import Union\ndef add2(x: Union[int, float, Tensor], y: Union[int, float, Tensor]) -> Union[int, float, Tensor]:\n return x+y\n
from torch import Tensor # note it is Tensor with upper case T. This is the base class of all tensors\ndef add2(x: int | float | Tensor, y: int | float | Tensor) -> int | float | Tensor:\n return x+y\n
Finally, since this is a very generic function it also works on numpy
arrays etc. we can always default to the Any
type if we are not sure about all the specific types that a function can take
from typing import Any\ndef add2(x: Any, y: Any) -> Any:\n return x+y\n
However, in this case we basically is in the same case as if our function were not typed, as the type hints does not help us at all. Therefore, use Any
only when necessary.
Exercise files
We provide a file called typing_exercise.py
. Add typing everywhere in the file. Please note that you will need the following import:
from typing import Callable, Optional, Tuple, Union, List # you will need all of them in your code\n
for it to work. This cheat sheet is a good resource on typing. We also provide typing_exercise_solution.py
, but try to solve the exercise yourself.
mypy is what is called a static type checker. If you are using typing in your code, then a static type checker can help you find common mistakes. mypy
does not run your code, but it scans it and checks that the types you have given are compatible. Install mypy
pip install mypy\n
Try to run mypy
on the typing.py
file
mypy typing_exercise.py\n
If you have solved exercise 11 correctly then you should get no errors. If not mypy
should tell you where your types are incompatible.
According to PEP8 what is wrong with the following code?
class myclass(nn.Module):\n def TrainNetwork(self, X, y):\n ...\n
Solution According to PEP8 classes should follow the CapWords convention, meaning that the first letter in each word of the class name should be capitalized. Thus myclass
should therefore be MyClass
. On the other hand, functions and methods should be full lowercase with words separated by underscore. Thus TrainNetwork
should be train_network
.
What would be the of argument x
for a function def f(x):
if it should support the following input
x1 = [1, 2, 3, 4]\nx2 = (1, 2, 3, 4)\nx3 = None\nx4 = {1: \"1\", 2: \"2\", 3: \"3\", 4: \"4\"}\n
Solution The easy solution would be to do def f(x : Any)
. But instead we could also go with:
def f(x: None | Tuple[int, ...] | List[int] | Dict[int, str]):\n
alternatively, we could also do
def f(x: None | Iterable[int]):\n
because both list
, tuple
and dict
are iterables and therefore can be covered by one type (in this specific case).
This ends the module on coding style. We again want to emphazize that a good coding style is more about having a consistent style than strictly following PEP8. A good example of this is Google, that have their own style guide for Python. This guide does not match PEP8 exactly, but it makes sure that different teams within google that are working on different projects are still to a large degree following the same style and therefore if a project is handed from one team to another then at least that will not be a problem.
"},{"location":"s3_reproducibility/","title":"Reproducibility","text":"Slides
Today is all about reproducibility - one of those concepts that everyone agrees is very important and something should be done about, but the reality is that it is very hard to secure full reproducibility. The last sessions have already touched a bit on how tools like conda
and code organization can help make code more reproducible. Today we are going all the way to ensure that our scripts and our computing environment are fully reproducible.
Reproducibility is closely related to the scientific method:
Observe -> Question -> Hypotheses -> Experiment -> Conclude -> Result -> Observe -> ...
Not having reproducible experiments essentially breaks the cycle between doing experiments and making conclusions. If experiments are not reproducible, then we do not expect that others will arrive at the same conclusion as ourselves. As machine learning experiments are fundamentally the same as doing chemical experiments in a laboratory, we should be equally careful in making sure our environments are reproducible (think of your laptop as your laboratory).
Secondly, if we focus on why reproducibility matters especially in machine learning, it is part of the bigger challenge of making sure that machine learning is trustworthy.
Many different aspects are needed if trustworthy machine learning is ever going to be a reality. We need robustness of our pipelines so we can trust that they do not fail under heavy load. We need integrity to make sure that pipelines are deployed if they are of high quality. We need explainability to make sure that we understand what our machine learning models are doing, so it is not just a black box. We need reproducibility to make sure that the results of our models can be reproduced over and over again. Finally, we need fairness to make sure that our models are not biased toward specific populations. Figure inspired by this paper.Trustworthy ML is the idea that machine learning agents can be trusted. Take the example of a machine learning agent being responsible for medical diagnoses. It is s very clear that we need to be able to trust that the agent gives us the correct diagnosis for the system to work in practice. Reproducibility plays a big role here, because without we cannot be sure that the same agent deployed at two different hospitals will give the same diagnosis (given the same input).
Learning objectives
The learning objectives of this session are:
docker
to create a reproducible container, including how to build them from scratchhydra
to integrate with config filesWith docker we can make sure that our compute environment is reproducible, but that does not mean that all our experiments magically become reproducible. There are other factors that are important for creating reproducible experiments.
In this paper (highly recommended read) the authors tried to reproduce the results of 255 papers and tried to figure out which factors where significant to succeed. One of those factors were \"Hyperparameters Specified\" e.g. whether or not the authors of the paper had precisely specified the hyperparameter that was used to run the experiments. It should come as no surprise that this can be a determining factor for reproducibility, however it is not given that hyperparameters are always well specified.
"},{"location":"s3_reproducibility/config_files/#configure-experiments","title":"Configure experiments","text":"There is really no way around it: deep learning contains a lot of hyperparameters. In general, a hyperparameter is any parameter that affects the learning process (e.g. the weights of a neural network are not hyperparameters because they are a consequence of the learning process). The problem with having many hyperparameters to control in your code, is that if you are not careful and structure them it may be hard after running a experiment to figure out which hyperparameters were actually used. Lack of proper configuration management can cause serious problems with reliability, uptime, and the ability to scale a system.
One of the most basic ways of structuring hyperparameters, is just to put them directly into you train.py
script in some object:
class my_hp:\n batch_size: 64\n lr: 128\n other_hp: 12345\n\n# easy access to them\ndl = DataLoader(Dataset, batch_size=my_hp.batch_size)\n
the problem here is configuration is not easy. Each time you want to run a new experiment, you basically have to change the script. If you run the code multiple times, without committing the changes in between then the exact hyperparameter configuration for some experiments may be lost. Alright, with this in mind you change strategy to use an argument parser e.g. run experiments like this
python train.py --batch_size 256 --learning_rate 1e-4 --other_hp 12345\n
This at least solves the problem with configurability. However, we again can end up with losing experiments if we are not careful.
What we really want is some way to easily configure our experiments where the hyperparameters are systematically saved with the experiment. For this we turn our attention to Hydra, a configuration tool that is based around writing config files to keep track of hyperparameters. Hydra operates on top of OmegaConf which is a yaml
based hierarchical configuration system.
A simple yaml
configuration file could look like
#config.yaml\nhyperparameters:\n batch_size: 64\n learning_rate: 1e-4\n
with the corresponding python code for loading the file
from omegaconf import OmegaConf\n# loading\nconfig = OmegaConf.load('config.yaml')\n\n# accessing in two different ways\ndl = DataLoader(dataset, batch_size=config.hyperparameters.batch_size)\noptimizer = torch.optim.Adam(model.parameters(), lr=config['hyperparameters']['lr'])\n
or using hydra
for loading the configuration
import hydra\n\n@hydra.main(config_name=\"basic.yaml\")\ndef main(cfg):\n print(cfg.hyperparameters.batch_size, cfg.hyperparameters.learning_rate)\n\nif __name__ == \"__main__\":\n main()\n
The idea behind refactoring our hyperparameters into .yaml
files is that we disentangle the model configuration from the model. In this way it is easier to do version control of the configuration because we have it in a separate file.
Exercise files
The main idea behind the exercises is to take a single script (that we provide) and use Hydra to make sure that everything gets correctly logged such that you would be able to exactly report to others how each experiment was configured. In the provided script, the hyperparameters are hardcoded into the code and your job will be to separate them out into a configuration file.
Note that we provide a solution (in the vae_solution
folder) that can help you get through the exercise, but try to look online for your answers before looking at the solution. Remember: its not about the result, its about the journey.
Start by install hydra: pip install hydra-core --upgrade
Next take a look at the vae_mnist.py
and model.py
file and understand what is going on. It is a model we will revisit during the course.
Identify the key hyperparameters of the script. Some of them should be easy to find, but at least 3 have made it into the core part of the code. One essential hyperparameter is also not included in the script but is needed to be completely reproducible (HINT: the weights of any neural network are initialized at random).
Write a configuration file config.yaml
where you write down the hyperparameters that you have found
Get the script running by loading the configuration file inside your script (using hydra) that incorporates the hyperparameters into the script. Note: you should only edit the vae_mnist.py
file and not the model.py
file.
Run the script
By default hydra will write the results to a outputs
folder, with a sub-folder for the day the experiment was run and further the time it was started. Inspect your run by going over each file the hydra has generated and check the information has been logged. Can you find the hyperparameters?
Hydra also allows for dynamically changing and adding parameters on the fly from the command-line:
Try changing one parameter from the command-line
python vae_mnist.py hyperparameters.seed=1234\n
Try adding one parameter from the command-line
python vae_mnist.py +experiment.stuff_that_i_want_to_add=42\n
By default the file vae_mnist.log
should be empty, meaning that whatever you printed to the terminal did not get picked up by Hydra. This is due to Hydra under the hood making use of the native python logging package. This means that to also save all printed output from the script we need to convert all calls to print
with log.info
Create a logger in the script:
import logging\nlog = logging.getLogger(__name__)\n
Exchange all calls to print
with calls to log.info
Try re-running the script and make sure that the output printed to the terminal also gets saved to the vae_mnist.log
file
Make sure that your script is fully reproducible. To check this you will need two runs of the script to compare. Then run the reproducibility_tester.py
script as
python reproducibility_tester.py path/to/run/1 path/to/run/2\n
the script will go over trained weights to see if the match and that the hyperparameters was the same. Note: for the script to work, the weights should be saved to a file called trained_model.pt
(this is the default of the vae_mnist.py
script, so only relevant if you have changed the saving of the weights)
Finally, make a new experiment using a new configuration file where you have changed a hyperparameter of your own choice. You are not allowed to change the configuration file in the script but should instead be able to provide it as an argument when launching the script e.g. something like
python vae_mnist.py experiment=exp2\n
We recommend that you use a file structure like this
|--conf\n| |--config.yaml\n| |--experiments\n| |--exp1.yaml\n| |--exp2.yaml\n|--my_app.py\n
Make your MNIST code reproducible! Apply what you have just done to the simple script to your MNIST code. Only requirement is that you this time use multiple configuration files, meaning that you should have at least one model_conf.yaml
file and a training_conf.yaml
file that separates out the hyperparameters that have to do with the model definition and those that have to do with the training. You can also choose to work with even more complex config setups: in the image below the configuration has two layers such that we individually can specify hyperparameters belonging to a specific model architecture and hyperparameters for each individual optimizer that we may try.
Image credit"},{"location":"s3_reproducibility/docker/","title":"M9 - Docker","text":""},{"location":"s3_reproducibility/docker/#docker","title":"Docker","text":"
Core Module
Image creditWhile the above picture may seem silly at first, it is actually pretty close to how docker came to existence. A big part of creating a MLOps pipeline, is that you are able to reproduce it. Reproducibility goes beyond versioning our code with git
and using conda
environment to keep track of our python installations. To really get reproducibility we need to also capture also system level components like
Docker provides this kind of system-level reproducibility by creating isolated programs dependencies. In addition to docker providing reproducibility, one of the key features are also scalability which is important when we later on are going to discuss deployment. Because docker is system-level reproducible, it does not (conceptually) matter if we try to start our program on a single machine or a 1000 machines at once.
"},{"location":"s3_reproducibility/docker/#docker-overview","title":"Docker overview","text":"Docker has three main concepts: docker file, docker image and docker container:
A docker file is a basic text document that contains all the commands a user could call on the command line to run an application. This includes installing dependencies, pulling data from online storage, setting up code and what commands that you want to run (e.g. python train.py
)
Running, or more correctly building a docker file will create a docker image. An image is a lightweight, standalone/containerized, executable package of software that includes everything (application code, libraries, tools, dependencies etc.) necessary to make an application run.
Actually running an image will create a docker container. This means that the same image can be launched multiple times, creating multiple containers.
The exercises today will focus on how to construct the actual docker file, as this is the first step to constructing your own container.
"},{"location":"s3_reproducibility/docker/#docker-sharing","title":"Docker sharing","text":"The whole point of using docker is that sharing applications becomes much easier. In general, we have two options
After creating the Dockerfile
we can simply commit it to github (its just a text file) and then ask other users to simply build the image by themselves.
After building the image ourself, we can choose to upload it to a image registry such as Docker Hub where other can get our image by simply running docker pull
, making them able to instantaneous running it as a container, as shown in the figure below
In the following exercises we guide you how to build a docker file for your MNIST repository that will make the training and prediction a self contained application. Please make sure that you somewhat understand each step and do not just copy of the exercise. Also note that you probably need to execute the exercise from an elevated terminal e.g. with administrative privilege.
The exercises today are only an introduction to docker and some of the steps are going to be unoptimized from a production setting view. For example we often want to keep the size of docker image as small as possible, which we are not focusing on for these exercises.
If you are using VScode
then we recommend install the docker VScode extension for easy getting an overview of which images have been build and which are running. Additionally the extension named Dev Containers may also be beneficial for you to download.
Start by installing docker. How much trouble that you need to go through depends on your operating system. For Windows and Mac we recommend they install Docker desktop, which comes with a graphical user interface (GUI) for quickly viewing docker images and docker containers currently build/in-use. Windows users that have not installed WSL yet are going to have to do it now (as docker need it as backend for starting virtual machines) but you do not need to install docker in WSL. After installing docker we recommend that you restart you laptop.
Try running the following to confirm that your installation is working:
docker run hello-world\n
which should give the message
Hello from Docker!\nThis message shows that your installation appears to be working correctly.\n
Next lets try to download a image from docker hub. Download the busybox
image:
docker pull busybox\n
which is an very small (1-5Mb) containerized application that contains the most essential GNU fileutils, shellutils etc.
After pulling the image, write
docker images\n
which should show you all images that are available. You should see the busybox
image that we just downloaded.
Lets try to run this image
docker run busybox\n
you will get that nothing happens! The reason for that is we did that not provide any commands to docker run
. We essentially just ask it to start the busybox
virtual machine, do nothing and then close it again. Now, try again this time with
docker run busybox echo \"hello from busybox\"\n
Note how fast this process is. In just a few seconds, Docker is able to start a virtual machine, execute a command and kill it afterwards.
Try running
docker ps\n
what does this command do? What if you add -a
to the end?
If we wanted to run multiple commands within the virtual machine, we can start it in interactive mode
docker run -it busybox\n
this can be a great way to investigate what the filesystem of our virtual machine looks like.
As you may have already noticed by now, each time we execute docker run
we can still see small remnants of the containers using docker ps -a
. These stray containers can end up take a lot of disk space. To remove them, use docker rm
where you provide the container id that you want to delete
docker rm <container_id>\n
Lets, now move on to trying to construct an docker file ourself for our MNIST project. Create a file called trainer.dockerfile
. The intention is that we want to develop one dockerfile for running our training script and one for doing predictions.
Instead of starting from scratch we nearly always want to start from some base image. For this exercise we are going to start from a simple python
image. Add the following to your Dockerfile
# Base image\nFROM python:3.9-slim\n
Next we are going to install some essentials in our image. The essentials more or less consist of a python installation. These instructions may seem familiar if you are using linux:
# install python\nRUN apt update && \\\n apt install --no-install-recommends -y build-essential gcc && \\\n apt clean && rm -rf /var/lib/apt/lists/*\n
The previous two steps are common for any docker application where you want to run python. All the remaining steps are application specific (to some degree):
Lets copy over our application (the essential parts) from our computer to the container:
COPY requirements.txt requirements.txt\nCOPY pyproject.toml pyproject.toml\nCOPY <project-name>/ <project-name>/\nCOPY data/ data/\n
Remember that we only want the essential parts to keep our docker image as small as possible. Why do we need each of these files/folders to run training in our docker container?
Lets set the working directory in our container and add commands that install the dependencies (1):
We split the the installation into two steps, such that docker can cache our project dependencies separately from our application code. This means that if we change our application code, we do not need to reinstall all the dependencies. This is a common strategy for docker images.
As an alternative you can use RUN make requirements
if you have a Makefile
that installs the dependencies. Just remember to also copy over the Makefile
into the docker image.
WORKDIR /\nRUN pip install -r requirements.txt --no-cache-dir\nRUN pip install . --no-deps --no-cache-dir\n
the --no-cache-dir
is quite important. Can you explain what it does and why it is important in relation to docker.
Finally, we are going to name our training script as the entrypoint for our docker image. The entrypoint is the application that we want to run when the image is being executed:
ENTRYPOINT [\"python\", \"-u\", \"<project_name>/train_model.py\"]\n
the \"u\"
here makes sure that any output from our script e.g. any print(...)
statements gets redirected to our terminal. If not included you would need to use docker logs
to inspect your run.
We are now ready to building our docker file into a docker image
docker build -f trainer.dockerfile . -t trainer:latest\n
MAC M1/M2 users In general docker images are build for a specific platform. For example, if you are using a Mac with a M1/M2 chip then you are running on a ARM architecture. If you are using a Windows or Linux machine then you are running on a AMD64 architecture. This is important to know when building docker images. Thus, docker images you build may not work on other platforms than the one you build it on. You can specify which platform you want to build for by adding the --platform
argument to the docker build
command:
docker build --platform linux/amd64 -f train.dockerfile . -t trainer:latest\n
and also when running the image:
docker run --platform linux/amd64 trainer:latest\n
Do not that this will significantly increase the build and run time of your docker image when running locally, because docker will need to emulate the other platform. In general for the exercises today, you should not need to specify the platform, but be aware of this if you are building docker images on your own.
please note here we are providing two extra arguments to docker build
. The -f train.dockerfile .
(the dot is important to remember) indicates which dockerfile that we want to run (except if you named it just Dockerfile
) and the -t trainer:latest
is the respective name and tag that we see afterwards when running docker images
(see image below). Please note that building a docker image can take a couple of minutes.
Docker images and space
Docker images can take up a lot of space on your computer. Especially, the docker images we are trying to build because Pytorch is huge dependency. If you are running low on space, you can try to
docker system prune\n
alternatively you can manually delete images using docker rmi {image_name}:{image_tag}
.
Try running docker images
and confirm that you get output similar to the one above. If you succeeds with this, then try running the docker image
docker run --name experiment1 trainer:latest\n
you should hopefully see your training starting. Please note that we can start as many containers that we want at the same time by giving them all different names using the --name
tag.
You are most likely going to re-build your docker image multiple times, either due to an implementation error or the addition of new functionality. Therefore, instead of watching pip suffer through downloading torch
for the 20th time, you can reuse the cache from last time the docker image was build. To do this, replace the line in your dockerfile that installs your requirements with:
RUN --mount=type=cache,target=~/pip/.cache pip install -r requirements.txt --no-cache-dir\n
which mounts your local pip cache to the docker image. For building the image you need to have enabled the BuildKit feature. If you have docker version v23.0 or later (you can check this by running docker version
) then this is enabled by default. Else you need to enable it by setting the environment variable DOCKER_BUILDKIT=1
before building the image.
Try changing your dockerfile and re-building the image. You should see that the build process is much faster.
Remember, if you ever are in doubt how files are organized inside a docker image you always have the option to start the image in interactive mode:
docker run -it --entrypoint sh {image_name}:{image_name}\n
When your training has completed you will notice that any files that is created when running your training script is not present on your laptop (for example if your script is saving the trained model to file). This is because the files were created inside your container (which is its own little machine). To get the files you have two options:
If you already have a completed run then you can use
docker cp\n
to copy the files between your container and laptop. For example to copy a file called trained_model.pt
from a folder you would do:
docker cp {container_name}:{dir_path}/{file_name} {local_dir_path}/{local_file_name}\n
Try this out.
A much more efficient strategy is to mount a volume that is shared between the host (your laptop) and the container. This can be done with the -v
option for the docker run
command. For example, if we want to automatically get the trained_model.pt
file after running our training script we could simply execute the container as
docker run --name {container_name} -v %cd%/models:/models/ trainer:latest\n
this command mounts our local models
folder as a corresponding models
folder in the container. Any file save by the container to this folder will be synchronized back to our host machine. Try this out! Note if you have multiple files/folders that you want to mount (if in doubt about file organization in the container try to do the next exercise first). Also note that the %cd%
need to change depending on your OS, see this page for help.
With training done we also need to write an application for prediction. Create a new docker image called predict.dockerfile
. This file should call your <project_name>/models/predict_model.py
script instead. This image will need some trained model weights to work. Feel free to either includes these during the build process or mount them afterwards. When you created the file try to build
and run
it to confirm that it works. Hint: if you are passing in the model checkpoint and prediction data as arguments to your script, your docker run
probably need to look something like
docker run --name predict --rm \\\n -v %cd%/trained_model.pt:/models/trained_model.pt \\ # mount trained model file\n -v %cd%/data/example_images.npy:/example_images.npy \\ # mount data we want to predict on\n predict:latest \\\n ../../models/trained_model.pt \\ # argument to script, path relative to script location in container\n ../../example_images.npy\n
(Optional, requires GPU support) By default a virtual machine created by docker only have access to your cpu
and not your gpu
. While you do not necessarily have a laptop with a GPU that supports training of neural network (e.g. one from Nvidia) it is beneficial that you understand how to construct a docker image that can take advantage of a GPU if you were to run this on a machine in the future that have a GPU (e.g. in the cloud). It does take a bit more work, but many of the steps will be similar to building a normal docker image.
There are three prerequisites for working with Nvidia GPU accelerated docker containers. First you need to have the Docker Engine installed (already taken care of), have Nvidia GPU with updated GPU drivers and finally have the Nvidia container toolkit installed. The last part you not likely have not installed and needs to do. Some distros of Linux have known problems with the installation process, so you may have to search through known issues in nvidia-docker repository to find a solution
To test that everything is working start by pulling a relevant Nvidia docker image. In my case this is the correct image:
docker pull nvidia/cuda:11.0.3-base-ubuntu20.04\n
but it may differ based on what cuda vision you have. You can find all the different official Nvidia images here. After pulling the image, try running the nvidia-smi
command inside a container based on the image you just pulled. It should look something like this:
docker run --rm --gpus all nvidia/cuda:11.0.3-base-ubuntu20.04 nvidia-smi\n
and should show an image like below:
If it does not work, try redoing the steps.
We should hopefully have a working setup now for running Nvidia accelerated docker containers. Next step is to get Pytorch inside of our container, such that our Pytorch implementation also correctly identify the GPU. Luckily for us Nvidia provides a set of docker images for GPU-optimized software for AI, HPC and visualizations through their NGC Catalog. The containers that have to do with Pytorch can be seen here. Try pulling the latest:
docker pull nvcr.io/nvidia/pytorch:22.07-py3\n
It may take some time, because the NGC images includes a lot of other software for optimizing Pytorch applications. It may be possible for you to find other images for running GPU accelerated applications that have a smaller memory footprint, but NGC are the recommend and supported way.
Lets test that this container work:
docker run --gpus all -it --rm nvcr.io/nvidia/pytorch:22.07-py3\n
this should run the container in interactive mode attached to your current terminal. Try opening python
in the container and try writing:
import torch\nprint(torch.cuda.is_available())\n
which hopefully should return True
.
Finally, we need to incorporate all this into our already developed docker files for our application. This is also fairly easy as we just need to change our FROM
statement in the beginning of our docker file:
FROM python:3.7-slim\n
change to
FROM nvcr.io/nvidia/pytorch:22.07-py3\n
try doing this to one of your docker files, build the image and run the container. Remember to check that your application is using GPU by printing torch.cuda.is_available()
.
(Optional) Another way you can use dockerfiles in your day to day work is for Dev-containers. Developer containers allows you to develop code directly inside a container, making sure that your code is running in the same environment as it will when deployed. This is especially useful if you are working on a project that has a lot of dependencies that are hard to install on your local machine. Setup instructions for VS code and Pycharm can be found here (should be simple since we have already installed docker):
We focus on the VS code setup here.
First install the Remote - Containers extension.
Create a .devcontainer
folder in your project root and create a Dockerfile
inside it. We keep this file very barebone for now, so lets just define a base installation of python:
FROM python:3.11-slim-buster\n\nRUN apt update && \\\n apt install --no-install-recommends -y build-essential gcc && \\\n apt clean && rm -rf /var/lib/apt/lists/*\n
Create a devcontainer.json
file in the .devcontainer
folder. This file should look something like this:
{\n \"name\": \"my_working_env\",\n \"dockerFile\": \"Dockerfile\",\n \"postCreateCommand\": \"pip install -r requirements.txt\"\n}\n
this file tells VS code that we want to use the Dockerfile
that we just created and that we want to install our python dependencies after the container has been created.
After creating these files, you should be able to open the command palette in VS code (F1) and search for the option Remote-Containers: Reopen in Container
or Remote-Containers: Rebuild and Reopen in Container
. Choose either of these options.
This will start a new VS code instance inside a docker container. You should be able to see this in the bottom left corner of your VS code window. You should also be able to see that the python interpreter has changed to the one inside the container.
You are now ready to start developing inside the container. Try opening a terminal and run python
and import torch
to confirm that everything is working.
(Optional) In M8 on Data version control you learned about the framework dvc
for version controlling data. A neutral question at this point would then be how to incorporate dvc
into our docker image. We need to do two things:
dvc
have all the correct files to pull data from our remote storagedvc
have the correct credentials to pull data from our remote storageWe are going to assume that dvc
(and any dvc
extension needed) is part of your requirement.txt
file and that it is already being installed in a RUN pip install -r requirements.txt
command in your dockerfile. If not, then you need to add it.
Add the following lines to your dockerfile
RUN dvc init --no-scm\nCOPY .dvc/config .dvc/config\nCOPY *.dvc *.dvc\nRUN dvc config core.no_scm true\nRUN dvc pull\n
The first line initialize dvc
in the docker image. The --no-scm
option is needed because normally dvc
can only be initialized inside a git repository, but this option allows to initialize dvc
without being in one. The second and third line copies over the dvc
config file and the dvc
metadate files that are needed to pull data from your remote storage. The last line pulls the data.
If your data is not public, we need to provide credentials in some way to pull the data. We are for now going to do it in a not-so-secure way. When dvc
first connected to your drive a credential file was created. This file is located in $CACHE_HOME/pydrive2fs/{gdrive_client_id}/default.json
where $CACHE_HOME
.
~/Library/Caches
~/.cache
This is the typical location, but it may vary depending on what distro you are running
{user}/AppData/Local
Find the file. The content should look similar to this (only some fields are shown):
{\n \"access_token\": ...,\n \"client_id\": ...,\n \"client_secret\": ...,\n \"refresh_token\": ...,\n ...\n}\n
We are going to copy the file into our docker image. This of course is not a secure way of doing it, but it is the easiest way to get started. As long as you are not sharing your docker image with anyone else, then it is fine. Add the following lines to your dockerfile before the RUN dvc pull
command:
```dockerfile COPY default.json dvc remote modify myremote --local gdrive_service_account_json_file_path default.json ````
where <path_to_default.json>
is the path to the default.json
file that you just found. The last line tells dvc
to use the default.json
file as the credentials for pulling data from your remote storage. You can confirm that this works by running dvc pull
in your docker image.
What is the difference between a docker image and a docker container?
SolutionA docker image is a template for a docker container. A docker container is a running instance of a docker image. A docker image is a static file, while a docker container is a running process.
What are the 3 steps involved in containerizing an application?
SolutionWhat advantage is there to running your application inside a docker container instead of running the application directly on your machine?
SolutionRunning inside a docker container gives you a consistent and independent environment for your application. This means that you can be sure that your application will run the same way on your machine as it will on another machine. Thus, docker gives the ability to abstract away the differences between different machines.
A docker container is build from a series of layers that are stacked on top of each others. This should be clear if you look at the output when building a docker image. What is the advantage of this?
SolutionThe advantage is efficiency and reusability. When a change is made to a docker image, only the layer(s) that are changed needs to be updated. For example, if you update the application code in your docker image, which usually is the last layer, then only that layer needs to be rebuild, making the process much faster. Additionally, if you have multiple docker images that share the same base image, then the base image only needs to be downloaded once.
The covers the absolute minimum you should know about docker to get a working image and container. If you want to really deep dive into this topic you can find a copy of the Docker Cookbook by S\u00e9bastien Goasguen in the literature folder.
If you are actively going to be using docker in the near future, one thing to consider is the image size. Even these simple images that we have build still takes up GB in size. A number of optimizations steps can be taken to reduce the image size for you or your end user. If you have time you can read this article on different approaches to reduce image size. Additionally, you can take a look at the dive-in extension for docker desktop that lets you explore in depth your docker images.
"},{"location":"s4_debugging_and_logging/","title":"Debugging, Profiling, Logging and Boilerplate","text":"Slides
Today we are initially going to go over three different topics that are all fundamentally necessary for any data scientist or DevOps engineer:
All three topics can be characterized by something you probably already are familiar with. Since you started programming, you have done debugging as nobody can write perfect code in the first try. Similarly, while you have not directly profiled your code, I bet that you at some point have had some very slow code and optimized it to run faster. Identifying and improving is the fundamentals of profiling code. Finally, logging is a very broad term and basically refers to any kind of output from your applications that help you at a later point identify the \"performance\" of you application.
However, while we expect you to already be familiar with these topics, we do not expect all of you to be expects in this as it is very rarely topics that are focused on. Today we are going to introduce some best practices and tools to help you overcome each and everyone of these three important topics.
As the final topic for today we are going to learn about how we can minimize boilerplate and focus on coding what actually matters for our project instead of all the boilerplate to get it working.
Learning objectives
The learning objectives of this session are:
pytorch-lightning
framework to minimize boilerplate code and structure deep learning modelsBoilerplate is a general term that describes any standardized text, copy, documents, methods, or procedures that may be used over again without making major changes to the original. But how does this relate to doing machine learning projects? If you have already tried doing a couple of projects within machine learning you will probably have seen a pattern: every project usually consist of these three aspects of code:
While the latter two certainly seems important, in most cases the actual development or research often revolves around defining the model. In this sense, both the training code and the utilities becomes boilerplate that should just carry over from one project to another. But the problem usually is that we have not generalized our training code to take care of the small adjusted that may be required in future projects and we therefore end up implementing it over and over again every time that we start a new project. This is of course a waste of our time that we should try to find a solution to.
This is where high-level frameworks comes into play. High-level frameworks are build on top of another framework (Pytorch in this case) and tries to abstract/standardize how to do particular tasks such as training. At first it may seem irritating that you need to comply to someone else code structure, however there is a good reason for that. The idea is that you can focus on what really matters (your task, model architecture etc.) and do not have to worry about the actual boilerplate that comes with it.
The most popular high-level (training) frameworks within the Pytorch
ecosystem are:
They all offer many of the same features, so choosing one over the other for most projects should not matter that much. We are here going to use Pytorch Lightning
, as it offers all the functionality that we are going to need later in the course.
In general we refer to the documentation from Pytorch lightning if in doubt about how to format your code for doing specific tasks. We are here going to explain the key concepts of the API that you need to understand to use the framework, starting with the LightningModule
and the Trainer
.
The LightningModule
is a subclass of a standard nn.Module
that basically adds additional structure. In addition to the standard __init__
and forward
methods that need to be implemented in a nn.Module
, a LightningModule
further requires two more methods implemented:
training_step
: should contain your actual training code e.g. given a batch of data this should return the loss that you want to optimize
configure_optimizers
: should return the optimizer that you want to use
Below is shown these two methods added to standard MNIST classifier
Compared to a standard nn.Module
, the additional methods in the LightningModule
basically specifies exactly how you want to optimize your model.
The second component to lightning is the Trainer
object. As the name suggest, the `Trainer object takes care of the actual training, automizing everything that you do not want to worry about.
from pytorch_lightning import Trainer\nmodel = MyAwesomeModel() # this is our LightningModule\ntrainer = Trainer()\ntraier.fit(model)\n
That's is essentially all that you need to specify in lightning to have a working model. The trainer object does not have methods that you need to implement yourself, but it have a bunch of arguments that can be used to control how many epochs that you want to train, if you want to run on gpu etc. To get the training of our model to work we just need to specify how our data should be feed into the lighning framework.
"},{"location":"s4_debugging_and_logging/boilerplate/#data","title":"Data","text":"For organizing our code that has to do with data in Lightning
we essentially have three different options. However, all three assume that we are using torch.utils.data.DataLoader
for the dataloading.
If we already have a train_dataloader
and possible also a val_dataloader
and test_dataloader
defined we can simply add them to our LightningModule
using the similar named methods:
def train_dataloader(self):\n return DataLoader(...)\n\ndef val_dataloader(self):\n return DataLoader(...)\n\ndef test_dataloader(self):\n return DataLoader(...)\n
Maybe even simpler, we can directly feed such dataloaders in the fit
method of the Trainer
object:
trainer.fit(model, train_dataloader, val_dataloader)\ntrainer.test(model, test_dataloader)\n
Finally, Lightning
also have the LightningDataModule
that organizes data loading into a single structure, see this page for more info. Putting data loading into a DataModule
makes sense as it is then can be reused between projects.
Callbacks is one way to add additional functionality to your model, that strictly speaking is not already part of your model. Callbacks should therefore be seen as self-contained feature that can be reused between projects. You have the option to implement callbacks yourself (by inheriting from the pytorch_lightning.callbacks.Callback
base class) or use one of the build in callbacks. Of particular interest are ModelCheckpoint
and EarlyStopping
callbacks:
The ModelCheckpoint
makes sure to save checkpoints of you model. This is in principal not hard to do yourself, but the ModelCheckpoint
callback offers additional functionality by saving checkpoints only when some metric improves, or only save the best K
performing models etc.
model = MyModel()\ncheckpoint_callback = ModelCheckpoint(\n dirpath=\"./models\", monitor=\"val_loss\", mode=\"min\"\n)\ntrainer = Trainer(callbacks=[checkpoint_callbacks])\ntrainer.fit(model)\n
The EarlyStopping
callback can help you prevent overfitting by automatically stopping the training if a certain value is not improving anymore:
model = MyModel()\nearly_stopping_callback = EarlyStopping(\n monitor=\"val_loss\", patience=3, verbose=True, mode=\"min\"\n)\ntrainer = Trainer(callbacks=[early_stopping_callback])\ntrainer.fit(model)\n
Multiple callbacks can be used by passing them all in a list e.g.
trainer = Trainer(callbacks=[checkpoint_callbacks, early_stopping_callback])\n
"},{"location":"s4_debugging_and_logging/boilerplate/#exercises","title":"\u2754 Exercises","text":"Please note that the in following exercise we will basically ask you to reformat all your MNIST code to follow the lightning standard, such that we can take advantage of all the tricks the framework has to offer. The reason we did not implement our model in lightning
to begin with, is that to truly understand why it is beneficially to use a high-level framework to do some of the heavy lifting you need to have gone through some of implementation troubles yourself.
Install pytorch lightning:
pip install pytorch-lightning # (1)!\n
pip install lightning
which includes more than just the Pytorch Lightning
package. This also includes Lightning Fabric
and Lightning Apps
which you can read more about here and here.Convert your corrupted MNIST model into a LightningModule
. You can either choose to completely override your old model or implement it in a new file. The bare minimum that you need to add while converting to get it working with the rest of lightning:
The training_step
method. This function should contain essentially what goes into a single training step and should return the loss at the end
The configure_optimizers
method
Please read the documentation for more info.
Make sure your data is formatted such that it can be loaded using the torch.utils.data.DataLoader
object.
Instantiate a Trainer
object. It is recommended to take a look at the trainer arguments (there are many of them) and maybe adjust some of them:
Investigate what the default_root_dir
flag does
As default lightning will run for 1000 epochs. This may be too much (for now). Change this by changing the appropriate flag. Additionally, there also exist a flag to set the maximum number of steps that we should train for.
To start with we also want to limit the amount of training data to 20% of its original size. which trainer flag do you need to set for this to work?
Try fitting your model: trainer.fit(model)
Now try adding some callbacks
to your trainer.
The privous module was all about logging in wandb
, so the question is naturally how does lightning
support this. Lightning does not only support wandb
, but also many others. Common for all of them, is that logging just need to happen through the self.log
method in your LightningModule
:
Add self.log
to your `LightningModule. Should look something like this:
def training_step(self, batch, batch_idx):\n data, target = batch\n preds = self(data)\n loss = self.criterion(preds, target)\n acc = (target == preds.argmax(dim=-1)).float().mean()\n self.log('train_loss', loss)\n self.log('train_acc', acc)\n return loss\n
Add the wandb
logger to your trainer
trainer = Trainer(logger=pl.loggers.WandbLogger(project=\"dtu_mlops\"))\n
and try to train the model. Confirm that you are seeing the scalars appearing in your wandb
portal.
self.log
does sadly only support logging scalar tensors. Luckily, for logging other quantities we can still access the standard wandb.log
through our model
def training_step(self, batch, batch_idx):\n ...\n # self.logger.experiment is the same as wandb.log\n self.logger.experiment.log({'logits': wandb.Histrogram(preds)})\n
try doing this, by logging something else than scalar tensors.
Finally, we maybe also want to do some validation or testing. In lightning we just need to add the validation_step
and test_step
to our lightning module and supply the respective data in form of a separate dataloader. Try to at least implement one of them.
(Optional, requires GPU) One of the big advantages of using lightning
is that you no more need to deal with device placement e.g. called .to('cuda')
everywhere. If you have a GPU, try to set the gpus
flag in the trainer. If you do not have one, do not worry, we are going to return to this when we are going to run training in the cloud.
(Optional) As default Pytorch uses float32
for representing floating point numbers. However, research have shown that neural network training is very robust towards a decrease in precision. The great benefit going from float32
to float16
is that we get approximately half the memory consumption. Try out half-precision training in Pytorch lightning. You can enable this by setting the precision flag in the Trainer
.
(Optional) Lightning also have built-in support for profiling. Checkout how to do this using the profiler argument in the Trainer
object.
(Optional) Another great feature of Lightning is that the allow for easily defining command line interfaces through the Lightning CLI feature. The Lightning CLI is essentially a drop in replacement for defining command line interfaces (covered in this module) and can also replace the need for config files (covered in this module) for securing reproducibility when working inside the Lightning framework. We highly recommend checking out the feature and that you try to refactor your code such that you do not need to call trainer.fit
anymore but it is instead directly controlled from the Lightning CLI.
Free exercise: Experiment with what the lightning framework is capable of. Either try out more of the trainer flags, some of the other callbacks, or maybe look into some of the other methods that can be implemented in your lightning module. Only your imagination is the limit!
That covers everything for today. It has been a mix of topics that all should help you write \"better\" code (by some objective measure). If you want to deep dive more into the Pytorch lightning framework, we highly recommend looking at the different tutorials in the documentation that covers more advanced models and training cases. Additionally, we also want to highlight other frameworks in the lightning ecosystem:
Debugging is very hard to teach and is one of the skills that just comes with experience. That said, there are good and bad ways to debug a program. We are all probably familiar with just inserting print(...)
statements everywhere in our code. It is easy and can many times help narrow down where the problem happens. That said, this is not a great way of debugging when dealing with a very large codebase. You should therefore familiarize yourself with the built-in python debugger as it may come in handy during the course.
To invoke the build in python debugger you can either:
Set a trace directly with the python debugger by calling
import pdb\npdb.set_trace()\n
anywhere you want to stop the code. Then you can use different commands (see the python_debugger_cheatsheet.pdf
) to step through the code.
If you are using an editor, then you can insert inline breakpoints (in VS code this can be done by pressing F9
) and then execute the script in debug mode (inline breakpoints can often be seen as small red dots to the left of your code). The editor should then offer some interface to allow you step through your code. Here is a guide to using the build in debugger in VScode.
Additionally, if your program is stopping on an error and you automatically want to start the debugger where it happens, then you can simply launch the program like this from the terminal
python -m pdb -c continue my_script.py\n
Exercise files
We here provide a script vae_mnist_bugs.py
which contains a number of bugs to get it running. Start by going over the script and try to understand what is going on. Hereafter, try to get it running by solving the bugs. The following bugs exist in the script:
Some of the bugs prevents the script from even running, while some of them influences the training dynamics. Try to find them all. We also provide a working version called vae_mnist_working.py
(but please try to find the bugs before looking at the script). Successfully debugging and running the script should produce three files:
orig_data.png
containing images from the standard MNIST training setreconstructions.png
reconstructions from the modelgenerated_samples.png
samples from the modelAgain, we cannot stress enough that the exercise is actually not about finding the bugs but using a proper debugger to find them.
"},{"location":"s4_debugging_and_logging/logging/","title":"M13 - Logging","text":""},{"location":"s4_debugging_and_logging/logging/#logging","title":"Logging","text":"Core Module
Logging in general refers to the practise of recording events activities over time. Having proper logging in your applications can be extremely beneficial for a few reasons:
Debugging becomes easier because we in a more structure way can output information about the state of our program, variables, values etc. to help identify and fix bugs or unexpected behavior.
When we move into a more production environment, proper logging is essential for monitoring the health and performance of our application.
It can help in auditing as logging info about specific activities etc. can help keeping a record of who did what and when.
Having proper logging means that info is saved for later, that can be analysed to gain insight into the behavior of our application, such as trends.
We are in this course going to divide the kind of logging we can do into categories: application logging and experiment logging. In general application logging is important regardless of the kind of application you are developing, whereas experiment logging is important machine learning based projects where we are doing experiments.
"},{"location":"s4_debugging_and_logging/logging/#application-logging","title":"Application logging","text":"The most basic form of logging in Python applications is the good old print
statement:
for batch_idx, batch in enumerate(dataloader):\n print(f\"Processing batch {batch_idx} out of {len(dataloader)}\")\n ...\n
This will keep a \"record\" of the events happening in our script, in this case how far we have progressed. We could even change the print to include something like batch.shape
to also have information about the current data being processed.
Using print
statements is fine for small applications, but to have proper logging we need a bit more functionality than what print
can offer. Python actually comes with a great logging module, that defines functions for flexible logging. It is exactly this we are going to look at in this module.
The four main components to the Python logging module are:
Logger: The main entry point for using the logging system. You create instances of the Logger class to emit log messages.
Handler: Defines where the log messages go. Handlers send the log messages to specific destinations, such as the console or a file.
Formatter: Specifies the layout of the log messages. Formatters determine the structure of the log records, including details like timestamps and log message content.
Level: Specifies the severity of a log message.
Especially, the last point is important to understand. Levels essentially allows of to get rid of statements like this:
if debug:\n print(x.shape)\n
where the logging is conditional on the variable debug
which we can set a runtime. Thus, it is something we can disable for users of our application (debug=False
) but have enabled when we develop the application (debug=True
). And it makes sense that not all things logged, should be available to all stakeholders of a codebase. We as developers probably always wants the highest level of logging, whereas users of the our code need less info and we may want to differentiate this based on users.
It is also important to understand the different between logging and error handling. Error handling Python is done using raise
statements and try/catch
like:
def f(x: int):\n if not isinstance(x, int):\n raise ValueError(\"Expected an integer\")\n return 2 * x\n\ntry:\n f(5):\nexcept ValueError:\n print(\"I failed to do a thing, but continuing.\")\n
Why would we evere need log warning
, error
, critical
levels of information, if we are just going to handle it? The reason is that raising exceptions are meant to change the program flow at runtime e.g. things we do not want the user to do, but we can deal with in some way. Logging is always for after a program have run, to inspect what went wrong. Sometimes you need one, sometimes the other, sometimes both.
Exercises are inspired by this made with ml module on the same topic. If you need help for the exercises you can find a simple solution script here.
As logging is a built-in module in Python, nothing needs to be installed. Instead start a new file called my_logger.py
and start out with the following code:
import logging\nimport sys\n\n# Create super basic logger\nlogging.basicConfig(stream=sys.stdout, level=logging.DEBUG)\nlogger = logging.getLogger(__name__) # (1)\n\n# Logging levels (from lowest to highest priority)\nlogger.debug(\"Used for debugging your code.\")\nlogger.info(\"Informative messages from your code.\")\nlogger.warning(\"Everything works but there is something to be aware of.\")\nlogger.error(\"There's been a mistake with the process.\")\nlogger.critical(\"There is something terribly wrong and process may terminate.\")\n
__name__
always contains the record of the script or module that is currently being run. Therefore if we initialize our logger base using this variable, it will always be unique to our application and not conflict with logger setup by any third-party package.Try running the code. Than try changing the argument level
when creating the logger. What happens when you do that?
Instead of sending logs to the terminal, we may also want to send them to a file. This can be beneficial, such that only warning
level logs and higher are available to the user, but debug
and info
is still saved when the application is running.
Try adding the following dict to your logger.py
file:
logging_config = {\n \"version\": 1,\n \"formatters\": { # (1)\n \"minimal\": {\"format\": \"%(message)s\"},\n \"detailed\": {\n \"format\": \"%(levelname)s %(asctime)s [%(name)s:%(filename)s:%(funcName)s:%(lineno)d]\\n%(message)s\\n\"\n },\n },\n \"handlers\": { # (2)\n \"console\": {\n \"class\": \"logging.StreamHandler\",\n \"stream\": sys.stdout,\n \"formatter\": \"minimal\",\n \"level\": logging.DEBUG,\n },\n \"info\": {\n \"class\": \"logging.handlers.RotatingFileHandler\",\n \"filename\": Path(LOGS_DIR, \"info.log\"),\n \"maxBytes\": 10485760, # 1 MB\n \"backupCount\": 10,\n \"formatter\": \"detailed\",\n \"level\": logging.INFO,\n },\n \"error\": {\n \"class\": \"logging.handlers.RotatingFileHandler\",\n \"filename\": Path(LOGS_DIR, \"error.log\"),\n \"maxBytes\": 10485760, # 1 MB\n \"backupCount\": 10,\n \"formatter\": \"detailed\",\n \"level\": logging.ERROR,\n },\n },\n \"root\": {\n \"handlers\": [\"console\", \"info\", \"error\"],\n \"level\": logging.INFO,\n \"propagate\": True,\n },\n}\n
The formatter section determines how logs should be formatted. Here we define two separate formatters, called minimal
and detailed
which we can use in the next part of the code.
The handlers is in charge of what should happen to different level of logging. console
uses the minimal
format we defined and sens logs to the stdout
stream for messages of level DEBUG
and higher. The info
handler uses the detailed
format and sends messages of level INFO
and higher to a separate info.log
file. The error
handler does the same for messages of level ERROR
and higher to a file called error.log
.
you will need to set the LOGS_DIR
variable and also figure out how to add this logging_config
using the logging config submodule to your logger.
When the code successfully runs, check the LOGS_DIR
folder and make sure that a info.log
and error.log
file was created with the appropriate content.
Finally, lets try to add a little bit of style and color to our logging. For this we can use the great package rich which is a great package for rich text and beautiful formatting in terminals. Install rich
and add the following line to your my_logger.py
script:
logger.root.handlers[0] = RichHandler(markup=True) # set rich handler\n
and try re-running the script. Hopefully you should see something beautiful in your terminal like this:
(Optional) We already briefly touched on logging during the module on config files using hydra. If you want to configure hydra to use custom logging scheme as the one we setup in the last two exercises, you can take a look at this page. In hydra you will need to provide the configuration of the logger as config file. You can find examples of such config file here.
When most people think machine learning, we think about the training phase. Being able to track and log experiments is an important part of understanding what is going on with your model while you are training. It can help you debug your model and help tweak your models to perfection. Without proper logging of experiments, it can be really hard to iterate on the model because you do not know what changes lead to increase or decrease in performance.
The most basic logging we can do when running experiments is writing the metrics that our model is producing e.g. the loss or the accuracy to the terminal or a file for later inspection. We can then also use tools such as matplotlib for plotting the progression of our metrics over time. This kind of workflow may be enough when doing smaller experiments or working alone on a project, but there is no way around using a proper experiment tracker and visualizer when doing large scale experiments in collaboration with others. It especially becomes important when you want to compare performance between different runs.
There exist many tools for logging your experiments, with some of them being:
All of the frameworks offers many of the same functionalities, you can see a (bias) review here. We are going to use Weights and Bias (wandb), as it support everything we need in this course. Additionally, it is an excellent tool for collaboration and sharing of results.
Using the Weights and Bias (wandb) dashboard we can quickly get an overview and compare many runs over different metrics. This allows for better iteration of models and training procedure."},{"location":"s4_debugging_and_logging/logging/#exercises_1","title":"\u2754 Exercises","text":"Start by creating an account at wandb. I recommend using your github account but feel free to choose what you want. When you are logged in you should get an API key of length 40. Copy this for later use (HINT: if you forgot to copy the API key, you can find it under settings).
Next install wandb on your laptop
pip install wandb\n
Now connect to your wandb account
wandb login\n
you will be asked to provide the 40 length API key. The connection should be remain open to the wandb server even when you close the terminal, such that you do not have to login each time. If using wandb
in a notebook you need to manually close the connection using wandb.finish()
.
With it all setup we are now ready to incorporate wandb
into our code. The interface is fairly simple, and this guide should give enough hints to get you through the exercise. (HINT: the two methods you need to call are wandb.init
and wandb.log
). To start with, logging the training loss of your model will be enough.
After running your model, checkout the webpage. Hopefully you should be able to see at least one run with something logged.
Now log something else than scalar values. This could be a image, a histogram or a matplotlib figure. In all cases the logging is still going to use wandb.log
but you need extra calls to wandb.Image
etc. depending on what you choose to log.
Finally, lets create a report that you can share. Click the Create report button and include some of the graphs/plots/images that you have generated in the report.
To make sure that you have completed todays exercises, make the report shareable by clicking the Share button and create view-only-link. Send the link to my email nsde@dtu.dk
, so I can checkout your awesome work \ud83d\ude03
When calling wandb.init
you have two arguments called project
and entity
. Make sure that you understand these and try them out. It will come in handy for your group work as they essentially allows multiple users to upload their own runs to the same project in wandb
.
Wandb also comes with build in feature for doing hyperparameter sweeping which can be beneficial to get a better working model. Look through the documentation on how to do a hyperparameter sweep in Wandb. You at least need to create a new file called sweep.yaml
and make sure that you call wandb.log
in your code on an appropriate value. Note: if you want hydra
and wandb
to work together you will need to change the command
config in your sweep.yaml
file, see this page.
In the future it will be important for us to be able to run Wandb inside a docker container (together with whatever training or inference we specify). The problem here is that we cannot authenticate Wandb in the same way as the previous exercise, it needs to happen automatically. Lets therefore look into how we can do that.
First we need to generate an authentication key, or more precise an API key. This is in general the way any service (like a docker container) can authenticate. Start by going https://wandb.ai/home, click your profile icon in the upper right corner and then go to settings. Scroll down to the danger zone and generate a new API key and finally copy it.
Next create a new docker file called wandb.docker
and add the following code
FROM python:3.9\nRUN apt update && \\\n apt install --no-install-recommends -y build-essential gcc && \\\n apt clean && rm -rf /var/lib/apt/lists/*\nRUN pip install wandb\nCOPY s4_debugging_and_logging/exercise_files/wandb_tester.py wandb_tester.py\nENTRYPOINT [\"python\", \"-u\", \"wandb_tester.py\"]\n
please take a look at the script being copied into the image and afterwards build the docker image.
When we want to run the image, what we need to do is including a environment variables that contains the API key we generated. This will then authenticate the docker container with the wandb server:
docker run -e WANDB_API_KEY=<your-api-key> wandb:latest\n
Try running it an confirm that the results are uploaded to the wandb server.
Feel free to experiment more with wandb
as it is a great tool for logging, organizing and sharing experiments.
That is the module on logging. Please note that at this point in the course you will begin to see some overlap between the different frameworks. While we mainly used hydra
for configuring our python scripts it can also be used to save metrics and hyperparameters similar to how wandb
can. Similar arguments holds for dvc
which can also be used to log metrics. In our opinion wandb
just offers a better experience when interacting with the results after logging. We want to stress that the combination of tools presented in this course may not be the best for all your future projects, and we recommend finding a setup that fits you. That said, each framework provide specific features that the others does not.
Finally, we want to note that we during the course really try to showcase a lot of open source frameworks, Wandb is not one. It is free to use for personal usage (with a few restrictions) but for enterprise it does require a license. If you are eager to only work with open-source tools we highly recommend trying out MLFlow which offers the same overall functionalities as Wandb.
"},{"location":"s4_debugging_and_logging/profiling/","title":"M12 - Profiling","text":""},{"location":"s4_debugging_and_logging/profiling/#profilers","title":"Profilers","text":"Core Module
"},{"location":"s4_debugging_and_logging/profiling/#profilers_1","title":"Profilers","text":"In general profiling code is about improving the performance of your code. In this session we are going to take a somewhat narrow approach to what \"performance\" is: runtime, meaning the time it takes to execute your program.
At the bare minimum, the two questions a proper profiling of your program should be able to answer is:
The first question is important to priorities optimization. If two methods A
and B
have approximately the same runtime, but A
is called 1000 more times than B
we should probably spend time optimizing A
over B
if we want to speedup our code. The second question is gives itself, directly telling us which methods are the expensive to call.
Using profilers can help you find bottlenecks in your code. In this exercise we will look at two different profilers, with the first one being the cProfile. cProfile
is pythons build in profiler that can help give you an overview runtime of all the functions and methods involved in your programs.
Run the cProfile
on the vae_mnist_working.py
script. Hint: you can directly call the profiler on a script using the -m
arg
python -m cProfile -o <output_file> -s <sort_order> myscript.py\n
Try looking at the output of the profiling. Can you figure out which function took the longest to run?
Can you explain the difference between tottime
and cumtime
? Under what circumstances does these differ and when are they equal.
To get a better feeling of the profiled result we can try to visualize it. Python does not provide a native solution, but open-source solutions such as snakeviz exist. Try installing snakeviz
and load a profiled run into it (HINT: snakeviz expect the run to have the file format .prof
).
Try optimizing the run! (Hint: The data is not stored as torch tensor). After optimizing the code make sure (using cProfile
and snakeviz
) that the code actually runs faster.
Profiling machine learning code can become much more complex because we are suddenly beginning to mix different devices (CPU+GPU), that can (and should) overlap some of their computations. When profiling this kind of machine learning code we are often looking for bottlenecks. A bottleneck is simple the place in your code that is preventing other processes from performing their best. This is the reason that all major deep learning frameworks also include their own profilers that can help profiling more complex applications.
The image below show a typical report using the build in profiler in pytorch. As the image shows the profiler looks both a the kernel
time (this is the time spend doing actual computations) and also transfer times such as memcpy
(where we are copying data between devices). It can even analyze your code and give recommendations.
Using the profiler can be as simple as wrapping the code that you want to profile with the torch.profiler.profile
decorator
with torch.profiler.profile(...) as prof:\n # code that I want to profile\n output = model(data)\n
"},{"location":"s4_debugging_and_logging/profiling/#exercises_1","title":"\u2754 Exercises","text":"Exercise files
In these investigate the profiler that is build into PyTorch already. Note that these exercises requires that you have PyTorch v1.8.1 installed (or higher). You can always check which version you currently have installed by writing (in a python interpreter):
import torch\nprint(torch.__version__)\n
But we always recommend to update to the latest Pytorch version for the best experience. Additionally, to display the result nicely (like snakeviz
for cProfile
) we are also going to use the tensorboard profiler extension
pip install torch_tb_profiler\n
A good starting point is too look at the API for the profiler. Here the important class to look at is the torch.profiler.profile
class.
Lets try out an simple example (taken from here):
Try to run the following code
import torch\nimport torchvision.models as models\nfrom torch.profiler import profile, ProfilerActivity\n\nmodel = models.resnet18()\ninputs = torch.randn(5, 3, 224, 224)\n\nwith profile(activities=[ProfilerActivity.CPU], record_shapes=True) as prof:\n model(inputs)\n
this will profile the forward
pass of Resnet 18 model.
Running this code will produce an prof
object that contains all the relevant information about the profiling. Try writing the following code:
print(prof.key_averages().table(sort_by=\"cpu_time_total\", row_limit=10))\n
what operation is taking most of the cpu?
Try running
print(prof.key_averages(group_by_input_shape=True).table(sort_by=\"cpu_time_total\", row_limit=30))\n
can you see any correlation between the shape of the input and the cost of the operation?
(Optional) If you have a GPU you can also profile the operations on that device:
with profile(activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA], record_shapes=True) as prof:\n model(inputs)\n
(Optional) As an alternative to using profile
as an context-manager we can also use its .start
and .stop
methods:
prof = profile(...)\nprof.start()\n... # code I want to profile\nprof.stop()\n
Try doing this on the above example.
The torch.profiler.profile
function takes some additional arguments. What argument would you need to set to also profile the memory usage? (Hint: this page) Try doing it to the simple example above and make sure to sort the sample by self_cpu_memory_usage
.
As mentioned we can also get a graphical output for better inspection. After having done a profiling try to export the results with:
prof.export_chrome_trace(\"trace.json\")\n
you should be able to visualize the file by going to chrome://tracing
in any chromium based web browser. Can you still identify the information printed in the previous exercises from the visualizations?
Running profiling on a single forward step can produce misleading results as it only provides a single sample that may depend on what background processes that are running on your computer. Therefore it is recommended to profile multiple iterations of your model. If this is the case then we need to include prof.step()
to tell the profiler when we are doing a new iteration
with profile(...) as prof:\n for i in range(10):\n model(inputs)\n prof.step()\n
Try doing this. Is the conclusion this the same on what operations that are taken up most of the time? Have the percentage changed significantly?
Additionally, we can also visualize the profiling results using the profiling viewer in tensorboard.
Start by initializing the profile
class with an additional argument:
from torch.profiler import profile, tensorboard_trace_handler\nwith profile(..., on_trace_ready=tensorboard_trace_handler(\"./log/resnet18\")) as prof:\n ...\n
Try run a profiling (using a couple of iterations) and make sure that a file with the .pt.trace.json
is produced in the log/resnet18
folder.
Now try launching tensorboard
tensorboard --logdir=./log\n
and open the page http://localhost:6006/#pytorch_profiler, where you should hopefully see an image similar to the one below:
Image credit
Try poking around in the interface.
Tensorboard have a nice feature for comparing runs under the diff
tab. Try redoing a profiling run but use model = models.resnet34()
instead. Load up both runs and try to look at the diff
between them.
As an final exercise, try to use the profiler on the vae_mnist_working.py
file from the previous module on debugging, where you profile a hole training run (not only the forward pass). What is the bottleneck during the training? Is it still the forward pass or is it something else? Can you improve the code somehow based on the information from the profiler.
This end the module on profiling. If you want to go into more details on this topic we can recommend looking into line_profiler and kernprof. A downside of using python's cProfile
is that it can only profiling at an functional/modular level, that is great for identifying hotspots in your code. However, sometimes the cause of an computationally hotspot is a single line of code in a function, which will not be caught by cProfile
. An example would be an simple index operations such as a[idx] = b
, which for large arrays and non-sequential indexes is really expensive. For these cases line_profiler and kernprof are excellent tools to have in your toolbox. Additionally, if you do not like cProfile we can also recommend py-spy which is another open-source profiling tool for python programs.
Slides
Continues integration is a sub-discipline of the general field of Continues X. Continuous X is one of the core elements of modern DevOps, and by extension MLOps. Continuous X assumes that we have a (long) developer pipeline (see image below) where we want to make some changes to our code e.g:
Basically, any code change we will expect will have a influence on the final result. The problem with doing changes to the start of our pipeline is that we want the change to propagate all the way through to the end of the pipeline.
Image creditThis is where continuous X comes into play. The word continuous here refers to the fact that the pipeline should continuously be updated as we make code changes. You can also choose to think of this as the automatization of processes. The X then covers that the process we need to go through to automate steps in the pipeline depends on where we are in the pipeline e.g. the tools needed to do continuous integration are different from the tools needed to do continuous delivery.
In this session, we are going to focus on continuous integration (CI). As indicated in the image above, CI usually takes care of the first part of the developer pipeline which has to do with the code base, code building and code testing. This is paramount to step in automatization as we would rather catch bugs at the beginning of our pipeline than in the end.
Learning objectives
The learning objectives of this session are:
The Github Actions we learned about in M16 are an powerful tool that can be used to much more than simply running our tests tests that we write for our application. In this module we are going to look at how we can use it for continuously building docker images. As you have already seen docker building can take a couple of minutes to build each time we do changes to our code base. For this reason we really just want to build a new image every time we do a commit of our code. Thus, it should come as no surprise that we can also automate the building process and furthermore we can take advantage of online compute power to parallelize the process.
As discussed in the initial module on docker, docker hub is an online solution for storing build docker images in the cloud that is then easy to pull down on whatever machine you want to run on. Docker hub is free to use for personal use, as long as the images you push are public. We are in this session going to look how we can automatically build and push our docker builds to docker hub. In a future module we are also going to look at the exact same process of building and pushing containers but this time to an general cloud provider.
"},{"location":"s5_continuous_integration/auto_docker/#exercises","title":"\u2754 Exercises","text":"For these exercises you can choose to work with any docker file of your choosing. If you want an easy docker file, you can use the following:
FROM busybox\nCMD echo \"Howdy cowboy\"\n
Alternatively, you can choose to focus on automatizing the training and prediction docker files back from M9. You will most likely need to change the docker image for your applications if they contains any references to your data e.g. you have an COPY data/ data/
statement in the file. Since we do not store our data in Github, we cannot copy it during the build process.
Start by pushing whatever docker file you want that should be continuously build to your repository
Start by creating a Docker Hub account
Next, within Docker Hub create an access token by going to Settings -> Security
. Click the New Access Token
button and give it a name that you recognize.
Copy the newly created access token and head over to your Github repository online. Go to Settings -> Secrets -> Actions
and click the New repository secret
. Copy over the access token and give it the name DOCKER_HUB_TOKEN
. Additionally, add two other secrets DOCKER_HUB_USERNAME
and DOCKER_HUB_REPOSITORY
that contains your docker username and docker repository name respectively.
Next we are going to construct the actual Github actions workflow file:
name: Docker Image CI\n\non:\n push:\n branches: [ master ]\n\njobs:\n build:\n runs-on: ubuntu-latest\n steps:\n - uses: actions/checkout@v2\n - name: Build the Docker image\n run: |\n echo \"${{ secrets.DOCKER_HUB_TOKEN }}\" | docker login \\\n -u \"${{ secrets.DOCKER_HUB_USERNAME }}\" --password-stdin docker.io\n docker build . --file Dockerfile \\\n --tag docker.io/${{ secrets.DOCKER_HUB_USERNAME }}/${{ secrets.DOCKER_HUB_REPOSITORY }}:$GITHUB_SHA\n docker push docker.io/${{ secrets.DOCKER_HUB_USERNAME }}/${{ secrets.DOCKER_HUB_REPOSITORY }}:$GITHUB_SHA\n
The first part of the workflow file should look somewhat recognizable. However, the last three lines are where all the magic happens. Carefully go through them and figure out what they do. If you want some help you can looking at the help page for docker login
, docker build
and docker push
.
Upload the workflow to your github repository and check that it is being executed. If everything you should be able to see the the build docker image in your container repository in docker hub.
Make sure that you can execute docker pull
locally to pull down the image that you just continuously build
(Optional) To test that the container works directly in github you can also try to include an additional step that actually runs the container.
- name: Run container\n run: |\n docker run ...\n
That ends the session on continues docker building. We are going to revisit this topic after introducing the basic concepts of working in the cloud, as it will make our life easier in the long run when we get to continues deployment (CD) that our containers are stored the same place where we are going to run them. For completeness it is worth mentioning that docker hub also offers the possibility of building your images in a continues way, by specifying so called build rules.
"},{"location":"s5_continuous_integration/cml/","title":"M19 - Continuous Machine Learning","text":""},{"location":"s5_continuous_integration/cml/#continuous-machine-learning","title":"Continuous Machine Learning","text":"The continuous integration we have looked at until now is what we can consider \"classical\" continuous integration, that have its roots in DevOps and not MLOps. While the test that we have written and the containers ww have developed in the previous session have be around machine learning, everything we have done translate to completely to how it would be done if we had developed any other application did not include machine learning.
In this session, we are now gonna change gear and look at continuous machine learning (CML). As the name may suggest we are now focusing on automatizing actual machine learning processes. You may ask why we need continues integration principals baked into machine learning pipelines? The reason is the same as with any continues integration, namely that we have a bunch of checks that we want our newly trained model to pass before we trust it. Writing unittests
secures that our code is not broken, but there are other failure modes of a machine learning pipeline that should be checked before the model is ready for deployment:
Answering these questions in a continues way are possible through continuous machine learning. For this session, we are going to use cml
by iterative.ai for this session. Strictly speaking, using the cml
framework is not a necessary component for doing continuous machine learning but it streamlined way of doing this and offers tools to easily get a report about how a specific run performed. If we where just interested in trigging model training every time we do a git push
we essentially just need to include
run: python train.py\n
to any of our workflow files.
The figure below describes the overall process using the cml
framework. It should be clear that it is the very same process that we go through as in the other continues integration sessions: push code
-> trigger github actions
-> do stuff
. The new part in this session is that we want an report of the finding of the automated run to appear after the run is done.
We are first going to revisit our train.py
script. If we want cml
to automatically be able to report the performance of our trained model to us after it is trained, we need to give it some statistics to work with. Below is some psedo-code that computes the accuracy and the confusion matrix of our trained model. Create an copy of your training script (call it train_cml.py
) and make sure your script is also producing an classification report and confusion matrix as in the pseudo-code.
# assume we have a trained model\nimport matplotlib.pyplot as plt\nfrom sklearn.metrics import classification_report, confusion_matrix, ConfusionMatrixDisplay\npreds, target = [], []\nfor batch in train_dataloader:\n x, y = batch\n probs = model(x)\n preds.append(probs.argmax(dim=-1))\n target.append(y.detach())\n\ntarget = torch.cat(target, dim=0)\npreds = torch.cat(preds, dim=0)\n\nreport = classification_report(target, preds)\nwith open(\"classification_report.txt\", 'w') as outfile:\n outfile.write(report)\nconfmat = confusion_matrix(target, preds)\ndisp = ConfusionMatrixDisplay(cm = confmat, )\nplt.savefig('confusion_matrix.png')\n
Similar to what we have looked at until now, automation happens using github workflow files. The main difference from continuous integration we have looked on until now, is that we are actually going to train our model whenever we do a git push
. Copy the following code into a new workflow (called cml.yaml
) and add that file to the folder were you keep your workflow files.
name: train-my-model\non: [push]\njobs:\n run:\n runs-on: [ubuntu-latest]\n steps:\n - uses: actions/checkout@v2\n - uses: iterative/setup-cml@v1\n - name: Train model\n run: |\n pip install -r requirements.txt # install dependencies\n python train.py # run training\n - name: Write report\n env:\n # this authenticates that the right permissions are in place\n REPO_TOKEN: ${{ secrets.GITHUB_TOKEN }}\n run: |\n # send all information to report.md that will be reported to us when the workflow finish\n cat classification_report.txt >> report.md\n cml-publish confusion_matrix.png --md >> report.md\n cml-send-comment report.md\n
Nearly everything in the workflow file should look familiar, except the last two lines.
Try pushing the workflow file to your github repository and make sure that it completes. If it does not, you may need to adjust the workflow file slightly.
Send yourself a pull-request. I recommend seeing this very short video on how to send yourself a pull-request with a small change. If you workflow file is executed correctly you should see github-actions
commenting with a performance report on your PR.
(Optional) cml
is offered by the same people behind dvc
and it should therefore come as no surprise that these features can interact with each other. If you want to deep dive into this, here is a great starting point.
The ends the session on continues machine learning. If you have not already noticed, one limitation of using github actions is that their default runners e.g. runs-on: [ubuntu-latest]
are only CPU machines (see hardware config . As we all know, modern machine learning more or less requires hardware acceleration (=GPUs) to train within reasonable time. Luckily for us cml
also integrated with large cloud provides and I therefore recommend that after doing through the modules on cloud computing that you return to this exercise and experiment with setting up self-hosted runners.
Core Module
With the tests established in the previous module we are now ready to move on to actually implementing some continuous integration in our pipeline. As you probably have already realized testing your code locally may take cumbersome to do, because
For these reasons we want to automate the testing, such that it done every time we push to our repository. If we combine this with only pushing to branches and then only merging these branches whenever all automated testing have passed, our code should be fairly safe against unwanted bugs (assuming your tests are well covering your code).
"},{"location":"s5_continuous_integration/github_actions/#github-actions_1","title":"Github actions","text":"Github actions are the CI solution that Github provides. Each of your repositories gets 2,000 minutes of free testing per month which should be more than enough for the scope of this course (and probably all personal projects you do). Getting Github actions setup in a repository may seem complicated at first, but workflow files that you create for one repository can more or less be reused for any other repository that you have.
Lets take a look at how a github workflow file is organized:
name
runs-on
we can specify which operation system we want the workflow to run on. We also have the possibility to specify multiple.steps
. This is where we specify the actual commands that should be run when the workflow is executed.Start by creating a .github
folder in the root of your repository. Add a sub-folder to that called workflows
.
Go over this page that explains how to do automated testing of python code in github actions. You do not have to understand everything, but at least get a feeling of what a workflow file should look like.
We have provided a workflow file called tests.yml
that should run your tests for you. Place this file in the .github/workflows/
folder. The workflow file consist of three steps
First a python environment is setup (in this case python 3.8)
Next all dependencies required to run the test are installed
Finally, pytest
is called and test will be run
For the script to work you need to define the requirements.txt
and requirements_tests.txt
. The first file should contain all packages required to run your code. The second file is all additional packages required to run the tests. In your simple case it may very well be that the second file is empty, however sometimes additional packages are used for testing that are not strictly required for the scripts to run.
Finally, try pushing the changes to your repository. Hopefully your tests should just start, and you will after sometime see a green check mark next to hash of the commit. Also try to checkout the Actions tap where you can see the history of actions run.
Normally we develop code one operating system and just hope that it will work on other operating systems. However, CI enables us to automatically test on other systems than ourself.
The provided tests.yml
only runs on one operating system. Which one?
Alter the file (or write a new) that executes the test on the two other main operating systems that exist.
As the workflow is currently setup, github actions will destroy every downloaded package when the workflow has been executed. To improve this we can take advantage of caching
:
Figure out how to implement caching
in your workflow file. You can find a guide here and here.
When you have implemented a caching system go to Actions->Caches
in your repository and make sure that they are correctly added. It should look something like the image below
Measure how long your workflow takes before and after adding caching
to your workflow. Did it improve the runtime of your workflow?
(Optional) Code coverage can also be added to the workflow file by uploading it as an artifact after running the coverage. Follow the instructions in this post on how to do it.
As stated in the introduction, ideally we want to only push our code to branches, such that our workflows run before we actually merge code into our codebase. We can directly prevent bad behavior by adding branch protection rules to our repository. Take the image below as an example from one of my own PRs:
In this example, the PR cannot be merge to the main branch before the following is fulfilled: At least 2 reviewers with write access have approved the PR, all Github actions marked as Required are passing and all conversations needs to be resolved. Since not all important tests are passing, further changes are necessary. We want to implement something similar. Do the following:
On your Github repository of choice, go to Settings -> Branches -> Add branch protection rule
:
To your main/master branch add the following rules:
To test that everything works, try creating a PR (possibly with a small bug) and see that your main/master branch is protected
One problem you may have encountered is running your tests that have to do with your data, with the core problem being that your data is actually not stored in github (assuming you have done module M8 - DVC) and therefore cannot be tested. However, it is possible for us to download data while running our CI. Lets try to setup that:
The first problem is that we need our CI needs to be able to authenticate with the our storage solution. We can take advantage of an authentication file that is created the first time we push with DVC. It is located in $CACHE_HOME/pydrive2fs/{gdrive_client_id}/default.json
where $CACHE_HOME
depends on your operating system:
~/Library/Caches
~/.cache
This is the typical location, but it may vary depending on what distro you are running
{user}/AppData/Local
Find the file. The content should look similar to this (only some fields are shown):
{\n \"access_token\": ...,\n \"client_id\": ...,\n \"client_secret\": ...,\n \"refresh_token\": ...,\n ...\n}\n
The content of that file is should be treated as an password an not shared with the world and the relevant question is therefore how to use this info in public repository. The answer is github secrets, where we can store information, access it in our workflow files and it is still not public. Navigate to the secrets option (as shown below) and create a secret with the name GDRIVE_CREDENTIALS_DATA
that contains the content of the file you found in the previous exercise.
Afterwards, add the following code to your workflow file:
- uses: iterative/setup-dvc@v1\n- name: Get data\n run: dvc pull\n env:\n GDRIVE_CREDENTIALS_DATA: ${{ secrets.GDRIVE_CREDENTIALS_DATA }}\n
that runs dvc pull
using the secret authentication file. For help you can visit this small repository that implements the same workflow.
Finally, add the changes, commit, push and confirm that everything works as expected. You should now be able to run unit tests that depends on your input data.
In module M6 on good coding practices (optional module) of the course you where introduced to a couple of good coding practices such as being consistent with your coding style, how your Python packages are sorted and that your code follows certain standards. All this was done using the ruff
framework. In this set of exercises we will setup github workflows that will automatically test for this.
Create a new workflow file called codecheck.yml
, that implements the following three steps
Setup python environment
Installs ruff
Runs ruff check
and ruff format
on the repository
(HINT: You should be able to just change the last steps of the tests.yml
workflow file)
In addition to ruff
we also used mypy
in those set of exercies for checking if the typing we added to our code was good enough. Add another step to the codecheck.yml
file which runs mypy
on your repository.
Try to make sure that all steps are passing on repository. Especially mypy
can be hard to get passing, so this exercise formally only requires you to get ruff
passing.
When working with Github actions you will often encounter the following 4 concepts:
Try to define them with your own words.
Solutionyaml
file that defines the instructions to execute on specific events. Needs to be placed in the .github/workflows
folder.The on
attribute specify upon which events the workflow will be triggered. Assume you have set the on
attribute to the following:
on:\n push:\n branches: [main]\n pull_request:\n branches: [main]\n schedule:\n - cron: \"0 0 * * *\"\n workflow_dispatch: {}\n
What 4 events would trigger the execution of that action?
Solutionmain
would trigger itmain
would trigger itThe trigger can be executed by manually triggering it through the Github UI, example shown below
This ends the module on Github workflows. If you are more interested in this topic you can checkout module M31 on documentation which first including locally building some documentation for your project and afterwards use Github actions for deploying it to Github Pages. Additionally, Github also have a lot of templates already for running a lot CI tasks. If you try to create a workflow file directly in Github you may encounter the following page
We highly recommend checking this out if you want to write any other kind of CI pipeline in Github actions. We can also recommend this repository that have an list of awesome actions and checkout the act repository which is a tool for running your GitHub Actions locally!
"},{"location":"s5_continuous_integration/pre_commit/","title":"M17 - Pre commit","text":""},{"location":"s5_continuous_integration/pre_commit/#pre-commit","title":"Pre-commit","text":"One of the cornerstones of working with git is remembering to commit your work often. Often committing makes sure that it is easier to identify and revert unwanted changes that you have introduced, because the code changes becomes smaller per commit.
However, as you hopefully already seen in the course there are a lot of mental task to do before you actually write git commit
in the terminal. The most basic thing is of course making sure that you have saved all your changes, and you are not committing a not up-to-date file. However, this also includes tasks such as styling, formatting, making sure all tests succeeds etc. All these mental to-do notes does not mix well with the principal of remembering to commit often, because you in principal have to do them every time.
The obvious solution to this problem is to automate all or some of our mental task every time that we do a commit. This is where pre-commit hooks comes into play, as they can help us attach additional tasks that should be run every time that we do a git commit
.
Pre-commit simply works by inserting whatever workflow we want to automate in between whenever we do a git commit
and afterwards would do a git push
.
The system works by looking for a file called .pre-commit-config.yaml
that we can configure. If we execute
pre-commit sample-config | out-file .pre-commit-config.yaml -encoding utf8\n
you should get a sample file that looks like
# See https://pre-commit.com for more information\n# See https://pre-commit.com/hooks.html for more hooks\nrepos:\n- repo: https://github.com/pre-commit/pre-commit-hooks\n rev: v3.2.0\n hooks:\n - id: trailing-whitespace\n - id: end-of-file-fixer\n - id: check-yaml\n - id: check-added-large-files\n
the file structure is very simple:
id
of the different hooks. The id
corresponds to an id
in this file: https://github.com/pre-commit/pre-commit-hooks/blob/master/.pre-commit-hooks.yamlWhen we are done defining our .pre-commit-config.yaml
we just need to install it
pre-commit install\n
this will make sure that the file is automatically executed whenever we run git commit
Install pre-commit
pip install pre-commit\n
Next create the sample file
pre-commit sample-config > .pre-commit-config.yaml\n
The sample file already contains 4 hooks. Make sure you understand what each do and if you need them at all.
pre-commit
works by hooking into the git commit
command, running whenever that command is run. For this to work, we need to install the hooks into git commit
. Run
pre-commit install\n
to do this.
Try to commit your recently created .pre-commit-config.yaml
file. You will likely not do anything, because pre-commit
only check files that are being committed. Instead try to run
pre-commit run --all-files\n
that will check every file in your repository.
Try adding at least another check from the base repository to your .pre-commit-config.yaml
file.
If you have completed the optional module M7 on good coding practice you will have learned about the linter ruff
. ruff
comes with its own pre-commit hook. Try adding that to your .pre-commit-config.yaml
file and see what happens when you try to commit files.
(Optional) Add more hooks to your .pre-commit-config.yaml
.
Sometimes you are in a hurry, so make sure that you also can do commits without running pre-commit
e.g.
git commit -m <message> --no-verify\n
Finally, figure out how to disable pre-commit
again (if you get tired of it).
That was all about how pre-commit
can be used to automate tasks. If you want to deep dive more into the topic you can checkout this page on how to define your own pre-commit
hooks.
Core Module
What often comes to mind for many developers, when discussing continuous integration (CI) is code testing. CI should secure that whenever a codebase is updated it is automatically tested such that if bugs have been introduced in the codebase it will be caught early on. If you look at the MLOps cycle, CI is one of the cornerstones of the operations part. However, it should be noted that applying CI does not magically secure that your code does not break. CI is only as strong as the tests that are automatically executed. CI simply structures and automates this.
Quote
Continuous Integration doesn\u2019t get rid of bugs, but it does make them dramatically easier to find and remove. Martin Fowler, Chief Scientist, ThoughtWorks
Image creditThe kind of tests we are going to look at are called unit testing. Unit testing refers to the practice of writing test that tests individual parts of your code base to test for correctness. By unit, you can therefore think of a function, module or in general any object. By writing tests in this way it should be very easy to isolate which part of the code broke after an update to the code base. Another way to test your code base would be through integration testing which is equally important but we are not going to focus on it in this course.
Unit tests (and integration tests) are not a unique concept to MLOps but are a core concept of DevOps. However, it is important to note that testing machine learning-based systems is much more difficult than traditional systems. The reason for this is that machine learning systems depend on data, that influences the state of our system. For this reason, we not only need unit tests and integration tests of our code we also need data testing, infrastructure testing and more monitoring to check that we stay within the data distribution we are training on (more on this in module M25 on data drifting). This added complexity is illustrated in the figure below.
"},{"location":"s5_continuous_integration/unittesting/#pytest","title":"Pytest","text":"Before we can begin to automate testing of our code base we of course need to write the tests first. It is both a hard and tedious task to do but arguably the most important aspect of CI. Python offers a couple of different libraries for writing tests. We are going to use pytest
.
The following exercises should be applied to your MNIST repository
The first part of doing CI is writing the unit tests. We do not expect you to cover every part of the code you have developed but try to at least write tests that cover two files. Start by creating a tests
folder.
Read the getting started guide for pytest which is the testing framework that we are going to use
Install pytest:
pip install pytest\n
Write some tests. Below are some guidelines on some tests that should be implemented, but you are of course free to implement more tests. You can at any point check if your tests are passing by typing in a terminal
pytest tests/\n
When you implement a test you need to follow two standards, for pytest
to be able to find your tests. First any files created (except __init__.py
) should always start with test_*.py
. Secondly, any test implemented needs to be wrapped into its own function that again needs to start with test_
:
# this will be found and executed by pytest\ndef test_something():\n ...\n\n# this will not be found and executed by pytest\ndef something_to_test():\n ...\n
Start by creating a tests/__init__.py
file and fill in the following:
import os\n_TEST_ROOT = os.path.dirname(__file__) # root of test folder\n_PROJECT_ROOT = os.path.dirname(_TEST_ROOT) # root of project\n_PATH_DATA = os.path.join(_PROJECT_ROOT, \"Data\") # root of data\n
these can help you refer to your data files during testing. For example, in another test file, I could write
from tests import _PATH_DATA\n
which then contains the root path to my data.
Data testing: In a file called tests/test_data.py
implement at least a test that checks that data gets correctly loaded. By this, we mean that you should check
def test_data():\n dataset = MNIST(...)\n assert len(dataset) == N_train for training and N_test for test\n assert that each datapoint has shape [1,28,28] or [784] depending on how you choose to format\n assert that all labels are represented\n
where N_train
should be either 30.000 or 50.000 depending on if you are just the first subset of the corrupted MNIST data or also including the second subset. N_test
should be 5000.
Model testing: In a file called tests/test_model.py
implement at least a test that checks for a given input with shape X that the output of the model has shape Y.
Training testing: In a file called tests/test_training.py
implement at least one test that asserts something about your training script. You are here given free hands on what should be tested but try to test something that risks being broken when developing the code.
Good code raises errors and gives out warnings in appropriate places. This is often in the case of some invalid combination of input to your script. For example, your model could check for the size of the input given to it (see code below) to make sure it corresponds to what you are expecting. Not implementing such errors would still result in Pytorch failing at a later point due to shape errors, however, these custom errors will probably make more sense to the end user. Implement at least one raised error or warning somewhere in your code and use either pytest.raises
or pytest.warns
to check that they are correctly raised/warned. As inspiration, the following implements ValueError
in code belonging to the model:
# src/models/model.py\ndef forward(self, x: Tensor):\n if x.ndim != 4:\n raise ValueError('Expected input to a 4D tensor')\n if x.shape[1] != 1 or x.shape[2] != 28 or x.shape[3] != 28:\n raise ValueError('Expected each sample to have shape [1, 28, 28]')\n
which would be captured by a test looking something like this:
# tests/test_model.py\ndef test_error_on_wrong_shape():\n with pytest.raises(ValueError, match='Expected input to a 4D tensor')\n model(torch.randn(1,2,3))\n
A test is only as good as the error message it gives, and by default, assert
will only report that the check failed. However, we can help ourselves and others by adding strings after assert
like
assert len(train_dataset) == N_train, \"Dataset did not have the correct number of samples\"\n
Add such comments to the assert statements you just did.
The tests that involve checking anything that has to do with our data, will of course fail if the data is not present. To future-proof our code, we can take advantage of the pytest.mark.skipif
decorator. Use this decorator to skip your data tests if the corresponding data files does not exist. It should look something like this
import os.path\n@pytest.mark.skipif(not os.path.exists(file_path), reason=\"Data files not found\")\ndef test_something_about_data():\n ...\n
You can read more about skipping tests here
After writing the different tests, make sure that they are passing locally.
We often want to check a function/module for various input arguments. In this case, you could write the same test over and over again for the different input, but pytest
also has built-in support for this with the use of the pytest.mark.parametrize decorator. Implement a parametrized test and make sure that it runs for different inputs.
There is no way of measuring how good the test you have written is. However, what we can measure is the code coverage. Code coverage refers to the percentage of your codebase that actually gets run when all your tests are executed. Having a high coverage at least means that all your code will run when executed.
Install coverage
pip install coverage\n
Instead of running your tests directly with pytest
, now do
coverage run -m pytest tests/\n
To get a simple coverage report simply type
coverage report\n
which will give you the percentage of cover in each of your files. You can also write
coverage report -m\n
to get the exact lines that were missed by your tests.
Finally, try to increase the coverage by writing a new test that runs some of the lines in your codebase that are not covered yet.
Often coverage
reports the code coverage on files that we actually do not want to get a code coverage for. Figure out how to configure coverage
to exclude some files.
Assuming you have a code coverage of 100%, would you expect that no bugs are present in your code?
SolutionNo, code coverage is not a guarantee that your code is bug-free. It is just a measure of how many lines of code are run when your tests are executed. Therefore, there may still be some corner case that is not covered by your tests and will result in a bug. However, having a high code coverage is a good indicator that you have tested your code.
Consider the following code:
@pytest.mark.parametrize(\"network_size\", [10, 100, 1000])\n@pytest.mark.parametrize(\"device\", [\"cpu\", \"cuda\"])\nclass MyTestClass:\n @pytest.mark.parametrize(\"network_type\", [\"alexnet\", \"squeezenet\", \"vgg\", \"resnet\"])\n @pytest.mark.parametrize(\"precision\", [torch.half, torch.float, torch.double])\n def test_network1(self, network_size, device, network_type, precision):\n if device == \"cuda\" and not torch.cuda.is_available():\n pytest.skip(\"Test requires cuda\")\n model = MyModelClass(network_size, network_type).to(device=device, dtype=precision)\n ...\n\n @pytest.mark.parametrize(\"add_dropout\", [True, False])\n def test_network2(self, network_size, device, add_dropout):\n if device == \"cuda\" and not torch.cuda.is_available():\n pytest.skip(\"Test requires cuda\")\n model = MyModelClass2(network_size, add_dropout).to(device)\n ...\n
how many tests are executed when running the above code?
SolutionThe answer depends on whether or not we are running on a GPU-enabled machine. The test_network1
has 4 parameters, network_size, device, network_type, precision
, that respectively can take on 3, 2, 4, 3
values meaning that in total that test will be running 3x2x4x3=72
times with different parameters on a GPU-enabled machine and 36 on a machine without a GPU. A similar calculation can be done for test_network2
, which only has three factors network_size, device, add_dropout
that result in 3x2x2=12
test on a GPU-enabled machine and 6 on a machine without a GPU. In total, that means 84 tests would run on a machine with a GPU and 42 on a machine without a GPU.
That covers the basics of writing unit tests for Python code. We want to note that pytest
of course is not the only framework for doing this. Python has a built-in framework called unittest for doing this also (but pytest
offers a bit more features). Another open-source framework that you could choose to check out is hypothesis which can help catch errors in corner cases of your code. In addition to writing unit tests it is also highly recommended to test code that you include in your docstring belonging to your functions and modulus to make sure that any code there is in your documentation is also correct. For such testing, we can highly recommend using Python built-in framework doctest.
Slides
Running computations locally is often sufficient when only playing around with code in initial phase of development. However, to really scale your experiments you will need more computing power than what your standard laptop/desktop can offer. You probably already have experience with running on a local cluster or similar but todays topic is about utilizing cloud computing.
Image creditThere exist a numerous amount of cloud compute providers with some of the biggest being:
The all have slight advantages and disadvantages over each others. In this course we are going to focus on Google cloud, because they have been kindly enough to sponsor $50 of cloud credit to each student. If you happen to run out of credit, you can also get some free credit for a limited amount of time when you signup with a new account. What's important to note is that all these different cloud providers all have the same set of services, and that learning how to use the services of one cloud provider in many cases translate to also know how to use the same services at another cloud provider. The services are called something different and can have a bit of a different interface/interaction pattern but in the end it does not really matter.
Todays exercises are about getting to know how to work with the cloud. If you are in doubt about anything or want to deep dive into some topics, I can recommend watching this series of videos or going through the general docs.
Learning objectives
The learning objectives of this session are:
Core Module
Google cloud project (GCP) is the cloud service provided by Google. The key concept, or selling point, of any cloud provider is the idea of near-infinite resources. Without the cloud it simply is not feasible to do many modern deep learning and machine learning tasks because they cannot be scaled locally.
The image below shows a subset of all the different services that the Google cloud platform offers. The ones marked in red are the ones we are actually going to investigate in this course. Therefore, if you get done with exercises early I highly recommend that you deep dive more into the Google cloud platform.
Image credit"},{"location":"s6_the_cloud/cloud_setup/#exercises","title":"\u2754 Exercises","text":"As the first step we are going to get you setup with some Google cloud credits.
Go to https://learn.inside.dtu.dk. Go to this course. Find the recent message where there should be a download link and instructions on how to claim the $50 cloud credit. Please do not share the link anywhere as there are a limited amount of coupons. If you are not officially taking this course at DTU, Google gives $300 cloud credits whenever you signup with a new account. NOTE that you need to provide a credit card for this so make sure to closely monitor your credit use so you do not end spending more than the free credit.
Login to the homepage of gcp. It should look like this:
Go to billing and make sure that your account is showing $50 of cloud credit
make sure to also checkout the Reports
throughout the course. When you are starting to use some of the cloud services these tabs will update with info about how much time you can use before your cloud credit runs out. Make sure that you monitor this page as you will not be given another coupon.
One way to stay organized within GCP is to create projects.
Create a new project called dtumlops
. When you click create
you should get a notification that the project is being created. The notification bell is good way to make sure how the processes you are running are doing throughout the course.
For setup we are going to install gcloud
. gcloud
is the command line interface for working with our Google cloud account. Nearly everything that we can do through the web interface we can also do through the gcloud
interface. Follow the installation instructions here for your specific OS.
After installation, try in a terminal to type:
gcloud -h\n
the command should and show the help page. If not, something went wrong in the installation (you may need to restart after installing).
Now login by typing
gcloud auth login\n
you should be sent to an web page where you link your cloud account to the gcloud
interface. Afterwards, also run this command:
gcloud auth application-default login\n
If you at some point want to revoke this you can type:
gcloud auth revoke\n
Next you will need to set the project that we just created. In your web browser under project info, you should be able to see the Project ID
belonging to your dtumlops
project. Copy this an type the following command in a terminal
gcloud config set project <project-id>\n
You can also get the project info by running
gcloud projects list\n
Next install the Google cloud python API:
pip install --upgrade google-api-python-client\n
Make sure that the python interface is also installed. In a python terminal type
import googleapiclient\n
this should work without any errors.
(Optional) If you are using VSCode you can also download the relevant extension called Cloud Code
. After installing it you should see a small Cloud Code
button in the action bar.
Finally, we need to activate a couple of developer APIs that are not activated by default. In a terminal write
gcloud services enable apigateway.googleapis.com\ngcloud services enable servicemanagement.googleapis.com\ngcloud services enable servicecontrol.googleapis.com\n
you can always check which services are enabled by typing
gcloud services list\n
After following these step your laptop should hopefully be setup for using gcp
locally. You are now ready to use their services, both locally on your laptop and in the cloud console.
A big part of using the cloud in a bigger organisation has to do with Admin and quotas. Admin here in general refers to the different roles that users of GCP and quotas refers to the amount of resources that a given user has access to. For example one employee, lets say a data scientist, may only be granted access to certain GCP services that have to do with development and training of machine learning model, with X
amounts of GPUs available to use to make sure that the employee does not spend too much money. Another employee, a devops engineer, probably do not need access to the same services and not necessarily the same resources.
In this course we are not going to focus too much on this aspect but it is important to know that it exists. One feature you are going to need for doing the project is how to share a project with other people. This is done through the IAM (Identities and Access Management) page. Simply click the Grant Access
button, search for the email of the person you want to share the project with and give them either Viewer
, Editor
or Owner
access, depending on what you want them to be able to do. The figure below shows how to do this.
What we are going to go through right now is how to increase the quotas for how many GPUs you have available for your project. By default any free accounts in GCP (or accounts using teaching credits) the default quota for GPUs that you can use is either 0 or 1 (their policies sometimes changes). We will in the exercises below try to increase it.
"},{"location":"s6_the_cloud/cloud_setup/#exercises_1","title":"\u2754 Exercises","text":"Start by enabling the Compute Engine
service. Simply search for it in the top search bar. It should bring you to the a page where you can enable the service (may take some time). We are going to look more into this service in the next module.
Next go to the IAM & Admin
page, again search for it in the top search bar. The remaining steps are illustrated in the figure below.
Go to the quotas page
In the search field search for GPUs (all regions)
(needs to match exactly, the search field is case sensitive), such that you get the same quota as in the image.
In the limit you can see what your current quota for the number of GPUs you can use are. Additional, to the right of the limit you can see the current usage. It is worth checking in on if you are ever in doubt if a job is running on GPU or not.
Click the quota and afterwards the Edit qoutas
button.
In the pop-op window, increase your limit to either 1 or 2.
After sending your request you can try clicking the Increase requests
tab to see the status of your request
If you are ever running into errors when working in GPU that contains statements about quotas
you can always try to go to this page and see what you are actually allowed to use currently and try to increase it. For example, when you get to training machine learning models using Vertex AI in the next module, you would most likely need to ask for quota increase for that service as well.
Finally, we want to note that a quota increase is sometimes not allowed within 24 hours of creating an account. If your request gets rejected, we recommend to wait a day and try again. If this does still not work, you may need to use their services some more to make sure you are not a bot that wants to mine crypto.
"},{"location":"s6_the_cloud/cloud_setup/#knowledge-check","title":"\ud83e\udde0 Knowledge check","text":"What considerations to take when choosing an GCP region for running a new application?
SolutionA series of factors may influence your choice of region, including:
The 3 major cloud providers all have the same services, but they are called something different depending on the provider. What are the corresponding names of these GCP services in AWS and Azure?
It is important to know these correspondences to navigate blogpost etc. about MLOps on the internet.
Solution GCP AWS Azure Compute Engine Elastic Compute Cloud (EC2) Virtual Machines Cloud storage Simple Storage Service (S3) Blob Storage Cloud functions Lambda Functions Serverless Compute Cloud run App Runner, Fargate, Lambda Container Apps, Container Instances Cloud build CodeBuild DevOps Vertex AI SageMaker AI PlatformCore Module
In this set of exercises we are going to get more familiar with the using some of the resources that the Google cloud project offers.
"},{"location":"s6_the_cloud/using_the_cloud/#compute","title":"Compute","text":"The most basic service of any cloud provider is the ability to create and run virtual machines. In gcp
this service is called Compute Engine API. A virtual machine allows you to essentially run an operating system that behaves like a completely separate computer. There are many reasons why one to use virtual machines:
Virtual machines allow you to scale your operations, essentially giving you access to infinitely many individual computers
Virtual machines allow you to use large scale hardware. For example if you are developing an deep learning model on your laptop and want to know the inference time for a specific hardware configuration, you can just create a virtual machine with those specs and run your model.
Virtual machines allow you to run processes in the \"background\". If you want to train a model for a week or more, you do not want to do this on your own laptop as you cannot really move it or do anything with while it is training. Virtual machines allow you to just launch a job and forget about it (at least until you run out of credit).
We are now going to start actually using the cloud.
Click on the Compute Engine
tab in sidebar on the homepage of gcp
.
Try to Create instance
. You will see the following image below.
Give it a meaningful name, set the location to some location that is closer to where you actually is (to reduce latency). Finally try to adjust the the configuration a bit. What two factors are effecting the price of the compute unit?
After figuring this out, create a e2-medium
instance (leave rest configured as default). Before clicking the Create
button make sure to check the Equavalent Command Line
button. You should see a very long command that you could have typed instead to do the exact same.
Now in a local terminal type:
gcloud compute instances list\n
you should hopefully see the instance you have just created.
You can start a terminal directly by typing:
gcloud beta compute ssh --zone <zone> <name> --project <project-id>\n
You can always see the exact command that you need to run to ssh
to an VM by selecting the View gcloud command
option in the Compute Engine overview (see image below).
While logged into the instance, check if Python and Pytorch is installed? You should see that neither is installed. The VM we have only specified what compute resources it should have, and not what software should be in it. We can fix this by starting VMs based on specific docker images (its all coming together).
gcp
Comes with a number of ready-to-go images for doing deep learning. More info can be found here. Try, running this line:
gcloud container images list --repository=\"gcr.io/deeplearning-platform-release\"\n
what does the output show?
Next, start (in the terminal) a new instance using a Pytorch image. The command for doing it should look something like this:
gcloud compute instances create <instance_name> \\\n --zone=<zone> \\\n --image-family=<image-family> \\\n --image-project=deeplearning-platform-release \\\n # add these arguments if you want to run on GPU\n --accelerator=\"type=nvidia-tesla-K80,count=1\" \\\n --maintenance-policy TERMINATE \\\n --metadata=\"install-nvidia-driver=True\" \\\n
You can find more info here on what <image-family>
should have as value and what extra argument you need to add if you want to run on GPU (if you have access).
ssh
to the VM as one of the previous exercises. Confirm that the container indeed contains both a python installation and Pytorch is also installed. Hint: you also have the possibility through the web page to start a browser session directly to the VMs you create:
Finally, everything that you have done locally can also be achieved through the web terminal, which of course comes pre-installed with the gcloud
command etc.
Try out launching this and run some of the commands from the previous exercises.
Stopping VMs
If you are not careful you can end up wasting a lot of credits on virtual machines that you are not using. VMs are charged by the minute, so even if you are not using them you are still paying for them. Therefore, it is important that you remember to stop your VMs when you are not using them. You can do this by either clicking the Stop
button in the VM overview page or by running the following command:
gcloud compute instances stop <instance-name>\n
"},{"location":"s6_the_cloud/using_the_cloud/#data-storage","title":"Data storage","text":"Another big part of cloud computing is storage of data. There are many reason that you want to store your data in the cloud including:
Cloud storage is luckily also very cheap. Google cloud only takes around $0.026 per GB per month. This means that around 1 TB of data would cost you $26 which is more than what the same amount of data would cost on Goggle Drive, but the storage in Google cloud is much more focused on enterprise where you have a need for accessing data through an API.
"},{"location":"s6_the_cloud/using_the_cloud/#exercises_1","title":"\u2754 Exercises","text":"When we did the exercise on data version control, we made dvc
work together with our own Google drive to storage data. However, a big limitation of this is that we need to authentic each time we try to either push or pull the data. The reason is that we need to use an API instead which is offered through gcp
.
We are going to follow the instructions from this page
Lets start by creating a data storage. On the GCP startpage, in the sidebar, click on the Cloud Storage
. On the next page click the Create bucket
:
Give the bucket an unique name, set it to a region close by and importantly remember to enable Object versioning under the last tab. Finally click Create
.
After creating the storage, you should be able to see it online and you should be able to see it if you type in your local terminal:
gsutil ls\n
gsutil
is an additional command to gcloud
, that provides more command line options.
Next we need the Google storage extension for dvc
pip install dvc[gs]\n
Now in your MNIST repository where you have already configured dvc, we are going to change the storage from our Google drive to our newly created Google cloud storage.
dvc remote add -d remote_storage <output-from-gsutils>\n
In addition we are also going to modify the remote to support object versioning (called version_aware
in dvc
):
dvc remote modify remote_storage version_aware true\n
This will change the default way that dvc
handles data. Instead of just storing the latest version of the data as content-addressable storage it will now store the data as it looks in our local repository, which allows us to not only use dvc
to download our data.
The above command will change the .dvc/config
file. git add
and git commit
the changes to that file. Finally, push data to the cloud
dvc push\n
Finally, make sure that you can pull without having to give your credentials. The easiest way to see this is to delete the .dvc/cache
folder that should be locally on your laptop and afterwards do a dvc pull
.
This setup should work when trying to access the data from your laptop, which we authenticated in the previous module. However, how can you access the data from a virtual machine, inside a docker container or from a different laptop? We in general recommend two ways:
You can make the bucket public accessible e.g. no authentication needed. That means that anyone with the url to the data can access it. This is the easiest way to do it, but also the least secure. You can read more about how to make your buckets public here.
You can create a service account which is a more secure way of accessing data. A service account is essentially a second user which you can give access to specific services. You can read more about how to create a service account here. Once you have created a service account you can give it access to a specific bucket by going to the Permissions
tab of the bucket and add the service account as a member.
If you need to authenticate your service account from a VM, you can do it by running the following command:
gcloud auth activate-service-account --key-file=<key-file>\n
where the <key-file
is the json file that you downloaded when you created the service account (DO NOT SHARE THIS).
You should hopefully at this point have seen the strength of using containers to create reproducible environments. They allow us to specify exactly the software that we want to run inside our VMs. However, you should already have run into two problems with containers
For this reason we want to move both the building process and the storage of images to the cloud. In GCP the service for this is called Artifact registry, formerly known as Container registry.
"},{"location":"s6_the_cloud/using_the_cloud/#exercises_2","title":"\u2754 Exercises","text":"For the purpose of these exercise I recommend that you start out with a dummy version of some code to make sure that the building process do not take too long. You are more than free to fork this repository. The repository contains a simple python script that does image classification using sklearn. The docker images for this application are therefore going to be substantially faster to build and smaller in size than the images we are used to that uses Pytorch.
Start by enabling the service: Google Artifact Registry API
and Google Cloud Build API
. This can be done through the web side (by searching for the services) or can also be enabled from the terminal:
gcloud services enable artifactregistry.googleapis.com\ngcloud services enable cloudbuild.googleapis.com\n
Google cloud building can in principal work out of the box with docker files. However, the recommended way is to add specialized cloudbuild.yaml
files. They should look something like this:
steps:\n - name: 'gcr.io/cloud-builders/docker'\n args: ['build', '-t', 'gcr.io/<project-id>/<image-name>', '.']\n - name: 'gcr.io/cloud-builders/docker'\n args: ['push', 'gcr.io/<project-id>/<image-name>']\n
which essentially is a basic yaml file that contains a list of steps, where each step consist of the service that should be used and the arguments for that service. In the above example we are calling the same service (cloud-builders/docker
) with different arguments (build
and then push
). Implement such a file in your repository. Hint: if you forked the repository then you at least need to change the <project-id>
.
From the gcp
homepage, navigate to the triggers panel:
Click on the manage repositories.
From there, click the Connect Repository
and go through the steps of authenticating your github profile with gcp
and choose the repository that you want to setup build triggers. For now, skip the Create a trigger (optional)
part by pressing Done
in the end.
Navigate back to the Triggers
homepage and click Create trigger
. Set the following:
Push to branch
^main$
Autodetected
or Cloud build configuration file
Finally click the Create
button and the trigger should show up on the triggers page.
To activate the trigger, push some code to the chosen repository.
Go to the Cloud Build
page and you should see the image being build and pushed.
Try clicking on the build to checkout the build process and building summary. As you can see from the image, if a build is failing you will often find valuable info by looking at the build summary. If you build is failing try to configure it to run in one of these regions: us-central1, us-west2, europe-west1, asia-east1, australia-southeast1, southamerica-east1
as specified in the documentation.
If/when your build is successful, navigate to the Artifact Registry
page. You should hopefully find that the image you just build was pushed here. Congrats!
Finally, to to pull your image down to your laptop
docker pull gcr.io/<project-id>/<image_name>:<image_tag>\n
you will need to authenticate docker
with gcp
first. Instructions can be found here, but the following command should hopefully be enough to make docker
and gcp
talk to each other:
gcloud auth configure-docker\n
Note: To do this you need to have docker
actively running in the background, as any other time you want to use docker
.
Automatization through the cloud is in general the way to go, but sometimes you may want to manually create images and push them to the registry. Figure out how to push an image to your Container Registry
. For simplicity you can just push the busybox
image you downloaded during the initial docker exercises. This page should help you with exercise.
As our final step in our journey through different GCP services in this module we are going to look at training of our models. This is one of the important tasks that GCP can help us with because we can always rent more hardware as long as we have credits, meaning that we can both scale horizontal (run more experiments) and vertical (run longer experiments).
We are going to checkout two ways of running our experiments. First we are going to return to the Compute Engine service because it gives the most simple form of scaling of experiments. That is: we create a VM with a appropriate docker image, we start it and login to the VM and we run our experiments. It is possible for most people to run a couple of experiments in parallel this way. However, what if there was an abstract layer that automatically created VM for us, lunched our experiments and the close the VM afterwards?
This is where Vertex AI service comes into play. This is a dedicated service for handling ML models in GCP in the cloud. Vertex AI is in principal and end-to-end service that can take care of everything machine learning related in the cloud. In this course we are primarily focused on just the training of our models, and then use other services for different parts of our pipeline.
"},{"location":"s6_the_cloud/using_the_cloud/#exercises_3","title":"\u2754 Exercises","text":"Lets start by see how we could train a model using Pytorch using the Compute Engine service:
Start by creating a appropriate VM. If you want to start a VM that have Pytorch pre-installed with only CPU support you can run the following command
gcloud compute instances create <instance-name> \\\n --zone europe-west1-b \\\n --image-family=pytorch-latest-cpu \\\n --image-project=deeplearning-platform-release\n
alternatively, if you have access to GPU in your GCP account you could start a VM in the following way
gcloud compute instances create <instance-name> \\\n --zone europe-west4-a \\\n --image-family=pytorch-latest-gpu \\\n --image-project=deeplearning-platform-release \\\n --accelerator=\"type=nvidia-tesla-v100,count=1\" \\\n --metadata=\"install-nvidia-driver=True\" \\\n --maintenance-policy TERMINATE\n
Next login into your newly created VM. You can either open an ssh
terminal in the cloud console or run the following command
gcloud beta compute ssh <instance-name>\n
It is recommend to always check that the VM we get is actually what we asked for. In this case the VM should have Pytorch pre-installed so lets check for that by running
python -c \"import torch; print(torch.__version__)\"\n
Additionally, if you have a VM with GPU support also try running the nvidia-smi
command.
When you have logged in to the VM, it works as your own machine. Therefore to run some training code you would need to do the same setup step you have done on your own machine: clone your github, install dependencies, download data, run code. Try doing this to make sure you can train a model.
(Optional, may not work as intended) The last step in the previous exercise involves a lot of setup that would be necessary to do every time we create a new VM, making horizontal scaling of experiments cumbersome. However, we have already developed docker images that can take care of most of the setup.
Lets for simplicity just create a very small docker image (called gcp_vm_tester.dockerfile
) that you can use
FROM gcr.io/deeplearning-platform-release/pytorch-cpu\nRUN pip install matplotlib\n
this basically just extends the base Pytorch image to also install matplotlib. The important part about the docker images that we want to use here is that they should not have an ENTRYPOINT
at the end, because we do not want the docker container to actually run our scripts, just install dependencies on startup.
Lets build docker and manually push it to our container repository in gcp. Build with:
docker build -f gcp_vm_tester.dockerfile.dockerfile . -t gcp_vm_tester:latest\n
and then push with
docker tag tester gcr.io/<project-id>/gcp_vm_tester\ndocker push gcr.io/<project-id>/gcp_vm_tester\n
confirm by going to the container registry in the cloud console and check that the image has been correctly pushed.
Lets then create a VM with that particular docker image. Instead of using gcloud compute instances create
we are now using the gcloud compute instances create-with-container
command
gcloud compute instances create-with-container <instance-name> \\\n --container-image=gcr.io/<project-id>/gcp_vm_tester\n --zone europe-west1-b\n
Confirm that everything works by accessing your newly created VM and run both of these commands
python -c \"import torch; print(torch.__version__)\"\npython -c \"import matplotlib; print(matplotlib.__version__)\"\n
We are now moving on to the final way to train our code, using Vertex AI
service.
Start by enabling it by searching for Vertex AI
in the cloud console and go to the service
The way we are going to use Vertex AI is to create custom jobs because we have already developed docker containers that contains everything to run our code. Thus the only command that we actually need to use is gcloud ai custom-jobs create
command. An example here would be:
gcloud ai custom-jobs create \\\n --region=europe-west1 \\\n --display-name=test-run \\\n --config=config.yaml\n
Essentially, this command combines everything into one command: it first creates a VM with the specs specified by a configuration file, then loads a container specified again in the configuration file and finally it runs everything. A example of a config file could be:
# config_cpu.yaml\nworkerPoolSpecs:\n machineSpec:\n machineType: n1-highmem-2\n replicaCount: 1\n containerSpec:\n imageUri: gcr.io/<project-id>/<docker-img>\n
if you only want to run on CPU and another example for GPU:
# config_gpu.yaml\nworkerPoolSpecs:\n machineSpec:\n machineType: n1-standard-8\n acceleratorType: NVIDIA_TESLA_T4 #(1)!\n acceleratorCount: 1\n replicaCount: 1\n containerSpec:\n imageUri: gcr.io/<project-id>/<docker-img>\n
you can read more about the configuration formatting here and the different types of machines here. Try to execute a job using the gcloud ai custom-jobs create
command. For additional documentation you can checkout the documentation on the command and this page and this page
Assuming you manage to lunch a job, you should see an output like this:
To executing the commands that is outputted to look at both the status and the progress of your job.
In addition you can also visit the Custom Jobs
tab in training
part of Vertex AI
Check it out.
During custom training we do not necessarily need to use dvc
for downloading our data. A more efficient way is to use cloud storage as a mounted file system. This allows us to access data directly from the cloud storage without having to download it first. All our training jobs are automatically mounted a gcs
folder in the root directory. Try to access the data from your training script:
# loading from a bucket using mounted file system\ndata = torch.load('/gcs/<my-bucket-name>/data.pt')\n# writing to a bucket using mounted file system\ntorch.save(data, '/gcs/<my-bucket-name>/data.pt')\n
is should speed up the training process a bit.
This ends the session on how to use Google cloud services for now. In a future session we are going to investigate a bit more of the services offered in GCP, in particular for deploying the models that we have just trained.
"},{"location":"s7_deployment/","title":"08. Model deployment","text":"Slides
Lets say that you have spend 1000 GPU hours and trained the most awesome model that you want to share with the world. One way to do this is of course to just place all your code in a github repository, upload a file with the trained model weights to your favorite online storage (assuming it is too big for github to handle) and ask people to just download your code and the weights to run the code by themselves. This is a fine approach in a small research setting, but in production you need to be able to deploy the model to an environment that is fully contained such that people can just execute without looking (too hard) at the code.
Image credit
In this session we try to look at methods specialized towards deployment of models on your local machine and also how to deploy services in the cloud.
Learning objectives
The learning objectives of this session are:
fastapi
and run it locallyCore Module
Before we can get deployment of our models we need to understand concepts such as APIs and requests. The core reason for this is that we need a new abstraction layer on top of our applications that are not Python-specific. While Python is the defacto language for machine learning, we cannot expect everybody else to use it and in particular, we cannot expect network protocols (both locally and external) to be able to communicate with our Python programs out of the box. For this reason, we need to understand requests, in particular HTTP requests and how to create APIs that can interact with those requests.
"},{"location":"s7_deployment/apis/#requests","title":"Requests","text":"When we are talking about requests, we are essentially talking about the communication method used in client-server types of architectures. As shown in the image below, in this architecture, the client (user) is going to send requests to a server (our machine learning application) and the server will give a response. For example, the user may send a request to get the class of a specific image, which our application will do and then send back the response in terms of a label.
Image creditThe common way of sending requests is called HTTP (Hypertext Transfer Protocol). It is essentially a specification of the intermediary transportation method between the client and server. An HTTP request essentially consists of two parts:
The common request methods are (case sensitive):
You can read more about the different methods here. For most machine learning applications, GET and POST are the core methods to remember. Additionally, if you want to read more about HTTP in general we highly recommend that you go over this comic strip protocol, but the TLDR is that it provides privacy, integrity and identification over the web.
"},{"location":"s7_deployment/apis/#exercises","title":"\u2754 Exercises","text":"We are going to do a couple of exercises on sending requests using requests package to get familiar with the syntax.
Start by installing the `requests`` package
pip install requests\n
Afterwards, create a small script and try to execute the code
import requests\nresponse = requests.get('https://api.github.com/this-api-should-not-exist')\nprint(response.status_code)\n
As you can see from the syntax, we are sending a request using the GET method. This code should return status code 404. Take a look at this page that contains a list of status codes. Next, let's call a page that exists
import requests\nresponse = requests.get('https://api.github.com')\nprint(response.status_code)\n
What is the status code now and what does it mean? Status codes are important when you have an application that is interacting with a server and wants to make sure that it does not fail, which can be done with simple if
statements on the status codes
if response.status_code == 200:\n print('Success!')\nelif response.status_code == 404:\n print('Not Found.')\n
Next, try to call the following
response=requests.get(\"https://api.github.com/repos/SkafteNicki/dtu_mlops\")\n
which gives back a payload. Essentially, payload refers to any additional data that is sent from the client to the server or vice-versa. Try looking at the response.content
attribute. What is the type of this attribute?
You should hopefully observe that the .content
attribute is of type bytes
. It is important to note that this is the standard way of sending payloads to encode them into byte
objects. To get a more human-readable version of the response, we can convert it to JSON format
response.json()\n
Important to remember that a JSON object in Python is just a nested dictionary if you ever want to iterate over the object in some way.
When we use the GET method we can additionally provide a params
argument, that specifies what we want the server to send back for a specific request URL:
response = requests.get(\n 'https://api.github.com/search/repositories',\n params={'q': 'requests+language:python'},\n)\n
Before looking at reponse.json()
can you explain what the code does? You can try looking at this page for help.
Sometimes the content of a page cannot be converted into JSON, because as already stated data is sent as bytes. Say that we want to download an image, which we can do in the following way
import requests\nresponse = requests.get('https://imgs.xkcd.com/comics/making_progress.png')\n
Try calling response.json()
, what happens? Next, try calling response.content
. To get the result in this case we would need to convert from bytes to an image:
with open(r'img.png','wb') as f:\n f.write(response.content)\n
The get
method is the most useful method because it allows us to get data from the server. However, as stated in the beginning multiple request methods exist, for example, the POST method for sending data to the server. Try executing:
pload = {'username':'Olivia','password':'123'}\nresponse = requests.post('https://httpbin.org/post', data = pload)\n
Investigate the response (this is an artificial example because we do not control the server).
Finally, we should also know that requests can be sent directly from the command line using the curl
command. Sometimes it is easier to send a request directly from the terminal and sometimes it is easier to do it from a script.
Make sure you have curl
installed, or else find instruction on installing it. To check call curl -
-help` with the documentation on curl.
To execute requests.get('https://api.github.com')
using curl we would simply do
curl -X GET \"https://api.github.com\"\ncurl -X GET -I \"https://api.github.com\" # if you want the status code\n
Try it yourself.
Try to redo some of the exercises yourself using curl
.
That ends the intro session on requests
. Do not worry if you are still not completely comfortable with sending requests, we are going to return to how we do it in practice when we have created our API. If you want to learn more about the requests
package you can check out this tutorial and if you want to see more examples of how to use curl
you can check out this page
Requests are all about being on the client side of our client-server architecture. We are now going to move on to the server side where we will be learning about writing the APIs that requests can interact with. An application programming interface (API) is essentially the way for the developer (you) tells a user how to use the application that you have created. The API is an abstraction layer that allows the user to interact with our application in the way we want them to interact with it, without the user even having to look at the code.
We can take the API from github as an example https://api.github.com. This API allows any user to retrieve, integrate and send data to Github without ever having to visit their webpage. The API exposes multiple endpoints that have various functions:
and we could go on. However, there may be functionality that Github is not interested in users having access to and they may therefore choose not to have endpoints for specific features (1).
The particular kind of API we are going to work with is called REST API (or RESTful API). The REST API specify specific constraints that a particular API needs to fulfill to be considered RESTful. You can read more about what the six guiding principles behind REST API on this page but one of the most important to have in mind is that the client-server architecture needs to be stateless. This means that whenever a request is send to the server it needs to be self-contained (all information included) and the server cannot rely on any previously stored information from previous requests.
To implement APIs in practise we are going to use FastAPI. FastAPI is a modern, fast (high-performance), web framework for building APIs with Python 3.6+ based on standard Python type hints. FastAPI is only one of many frameworks for defining APIs, however, compared to other frameworks such as Flask and django it offers a sweet spot of being flexible enough to do what you want without having many additional (unnecessary) features.
"},{"location":"s7_deployment/apis/#exercises_1","title":"\u2754 Exercises","text":"The exercises below are a condensed version of this and this tutorial. If you ever need context for the exercises, we can recommend trying to go through these. Additionally, we also provide this solution file that you can look through for help.
Install FastAPI
pip install fastapi\n
This contains the functions, modules, and variables we are going to need to define our interface.
Additionally, also install uvicorn
which is a package for defining low level server applications.
pip install uvicorn[standard]\n
Start by defining a small application like this in a file called main.py
:
from fastapi import FastAPI\napp = FastAPI()\n\n@app.get(\"/\")\ndef read_root():\n return {\"Hello\": \"World\"}\n\n@app.get(\"/items/{item_id}\")\ndef read_item(item_id: int):\n return {\"item_id\": item_id}\n
Importantly here is the use of the @app.get
decorator. What could this decorator refer to? Explain what the two functions are probably doing.
Next lets launch our app. Since we called our script main.py
and we inside the script initialized our API with app = FastAPI
, our application that we want to deploy can be referenced by main:app
:
uvicorn --reload --port 8000 main:app\n
this will launch a server at this page: http://localhost:8000/
. As you will hopefully see, this page will return the content of the root
function, like the image below. Remember to also check the output in your terminal as that will give info on when and how your application is being invoked.
What webpage should you open to get the server to return 1
?
Also checkout the pages: http://localhost:8000/docs
and http://localhost:8000/redoc
. What does these pages show?
The power of the docs
and redoc
pages is that they allow you to easily test your application with their simple UI. As shown in the image below, simply open the endpoint you want to test, click the Try it out
button, input any values and execute it. It will return both the corresponding curl
command for invoking your endpoint, the corresponding URL and response of you application. Try it out.
You can also checkout http://localhost:8000/openapi.json
to check out the schema that is generated which essentially is a json
file containing the overall specifications of your program.
Try to access http://localhost:8000/items/foo
, what happens in this case? When you specify types in your API, FastAPI will automatically do type validation using pydantic, making sure users can only access your API with the correct types. Therefore, remember to include types in your applications!
With the fundamentals in place let's configure it a bit more:
Lets start by changing the root function to include a bit more info. In particular we are also interested in returning the status code so the end user can easily read that. Default status codes are included in the http built-in python package:
from http import HTTPStatus\n\n@app.get(\"/\")\ndef root():\n \"\"\" Health check.\"\"\"\n response = {\n \"message\": HTTPStatus.OK.phrase,\n \"status-code\": HTTPStatus.OK,\n }\n return response\n
try to reload the app and see what is returned now. You should not have to re-launch the app because we initialized the app with the --reload
argument.
When we decorate our functions with @app.get(\"/items/{item_id}\")
, item_id
is in the case what we call a path parameters because it is a parameter that is directly included in the path of our endpoint. We have already seen how we can restrict a path to a single type, but what if we want to restrict it to specific values? This is often the case if we are working with parameters of type str
. In this case we would need to define a enum
:
from enum import Enum\nclass ItemEnum(Enum):\n alexnet = \"alexnet\"\n resnet = \"resnet\"\n lenet = \"lenet\"\n\n@app.get(\"/restric_items/{item_id}\")\ndef read_item(item_id: ItemEnum):\n return {\"item_id\": item_id}\n
Add this API, reload and execute both a valid parameter and a non-valid parameter.
In contrast to path parameters we have query parameters. In the requests exercises we saw an example of this where we were calling https://api.github.com/search/code with the query 'q': 'requests+language:python'
. Any parameter in FastAPI that is not a path parameter, will be considered a query parameter:
@app.get(\"/query_items\")\ndef read_item(item_id: int):\n return {\"item_id\": item_id}\n
Add this API, reload and figure out how to pass in a query parameter.
We have until now worked with the .get
method, but lets also see an example of the .post
method. As already described the POST request method is used for uploading data to the server. Here is a simple app that saves username and password in a database (please never implement this in real life like this):
database = {'username': [ ], 'password': [ ]}\n\n@app.post(\"/login/\")\ndef login(username: str, password: str):\n username_db = database['username']\n password_db = database['password']\n if username not in username_db and password not in password_db:\n with open('database.csv', \"a\") as file:\n file.write(f\"{username}, {password} \\n\")\n username_db.append(username)\n password_db.append(password)\n return \"login saved\"\n
Make sure you understand what the function does and then try to execute it a couple of times to see your database updating. It is important to note that we sometimes in the following exercises use the .get
method and sometimes the .post
method. For our usage it does not really matter.
We are now moving on to figuring out how to provide different standard inputs like text, images, json to our APIs. It is important that you try out each example yourself and in particular you look at the curl
commands that are necessary to invoke each application.
Here is a small application, that takes a single text input
@app.get(\"/text_model/\")\ndef contains_email(data: str):\n regex = r'\\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Z|a-z]{2,}\\b'\n response = {\n \"input\": data,\n \"message\": HTTPStatus.OK.phrase,\n \"status-code\": HTTPStatus.OK,\n \"is_email\": re.fullmatch(regex, data) is not None\n }\n return response\n
What does the application do? Try it out yourself
Let's say we wanted to extend the application to check for a specific email domain, either gmail
or hotmail
. Assume that we want to feed this into our application as a json
object e.g.
{\n \"email\": \"mlops@gmail.com\",\n \"domain_match\": \"gmail\"\n}\n
Figure out how to alter the data
parameter such that it takes in the json
object and make sure to extend the application to check if the email and domain also match. Hint: take a look at this page
Let's move on to an application that requires a file input:
from fastapi import UploadFile, File\nfrom typing import Optional\n\n@app.post(\"/cv_model/\")\nasync def cv_model(data: UploadFile = File(...)):\n with open('image.jpg', 'wb') as image:\n content = await data.read()\n image.write(content)\n image.close()\n\n response = {\n \"input\": data,\n \"message\": HTTPStatus.OK.phrase,\n \"status-code\": HTTPStatus.OK,\n }\n return response\n
A couple of new things are going on here: we use the specialized UploadFile
and File
bodies in our input definition. Additionally, we added the async
/await
keywords. Figure out what everything does and try to run the application (you can use any image file you like).
The above application actually does not do anything. Let's add opencv as a package and let's resize the image. It can be done with the following three lines:
import cv2\nimg = cv2.imread(\"image.jpg\")\nres = cv2.resize(img, (h, w))\n
Figure out where to add them in the application and additionally add h
and w
as optional parameters, with a default value of 28. Try running the application where you specify everything and one more time where you leave out h
and w
.
Finally, let's also figure out how to return a file from our application. You will need to add the following lines:
from fastapi.responses import FileResponse\ncv2.imwrite('image_resize.jpg', res)\nFileResponse('image_resize.jpg')\n
Figure out where to add them to the code and try running the application one more time to see that you get a file back with the resized image.
(Optional) Let's try to figure out how to use FastAPI in a machine learning context. Below is a script that downloads a VisionEncoderDecoder
from huggingface . The model can be used to create captions for a given image. Thus calling
predict_step(['s7_deployment/exercise_files/my_cat.jpg'])\n
returns a list of strings like ['a cat laying on a couch with a stuffed animal']
(try this yourself). Create a FastAPI application that can do inference using this model e.g. it should take in an image, preferably an optional json
object for configuring some of the hyperparameters (like max_length
) and should return a string containing the generated caption.
from transformers import VisionEncoderDecoderModel, ViTFeatureExtractor, AutoTokenizer\nimport torch\nfrom PIL import Image\n\nmodel = VisionEncoderDecoderModel.from_pretrained(\"nlpconnect/vit-gpt2-image-captioning\")\nfeature_extractor = ViTFeatureExtractor.from_pretrained(\"nlpconnect/vit-gpt2-image-captioning\")\ntokenizer = AutoTokenizer.from_pretrained(\"nlpconnect/vit-gpt2-image-captioning\")\n\ndevice = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\nmodel.to(device)\n\ngen_kwargs = {\"max_length\": 16, \"num_beams\": 8, \"num_return_sequences\": 1}\ndef predict_step(image_paths):\n images = []\n for image_path in image_paths:\n i_image = Image.open(image_path)\n if i_image.mode != \"RGB\":\n i_image = i_image.convert(mode=\"RGB\")\n\n images.append(i_image)\n pixel_values = feature_extractor(images=images, return_tensors=\"pt\").pixel_values\n pixel_values = pixel_values.to(device)\n output_ids = model.generate(pixel_values, **gen_kwargs)\n preds = tokenizer.batch_decode(output_ids, skip_special_tokens=True)\n preds = [pred.strip() for pred in preds]\n return preds\n
As the final step, we want to figure out how to include our FastAPI application in a docker container as it will help us when we want to deploy in the cloud because docker as always can take care of the dependencies for our application. For the following set of exercises you can take whatever previous FastAPI application as the base application for the container
Start by creating a requirement.txt
file for your application. You will at least need fastapi
and uvicorn
in the file and we always recommend that you are specific about the version you want to use
fastapi>=0.68.0,<0.69.0\nuvicorn>=0.15.0,<0.16.0\n# add anything else you application needs to be able to run\n
Next, create a Dockerfile
with the following content
FROM python:3.9\nWORKDIR /code\nCOPY ./requirements.txt /code/requirements.txt\n\nRUN pip install --no-cache-dir --upgrade -r /code/requirements.txt\nCOPY ./app /code/app\n\nCMD [\"uvicorn\", \"app.main:app\", \"--host\", \"0.0.0.0\", \"--port\", \"80\"]\n
The above assumes that your file structure looks like this
.\n\u251c\u2500\u2500 app\n\u2502 \u251c\u2500\u2500 __init__.py\n\u2502 \u2514\u2500\u2500 main.py\n\u251c\u2500\u2500 Dockerfile\n\u2514\u2500\u2500 requirements.txt\n
Hopefully, all these steps should look familiar if you already went through module M9, except for maybe the last line. However, this is just the standard way that we have run our FastAPI applications in the last couple of exercises, this time with some extra arguments regarding the ports we allow.
Next, build the corresponding docker image
docker build -t my_fastapi_app .\n
Finally, run the image such that a container is spun up that runs our application. The important part here is to remember to specify the -p
argument (p for port) which should be the same number as the port we have specified in the last line of our Dockerfile.
docker run --name mycontainer -p 80:80 myimage\n
Check that everything is working by going to the corresponding localhost page http://localhost/items/5?q=somequery
(Optional) In module M15 on unittesting you learned how to write unit tests for your data pipeline and model. It should come as no surprise that the same can also be done for your API. Doing so should be able to tell you if your API is working as you expect it to do. The only complication regarding APIs is that you need a server to do testing, and we cannot use uvicorn
for this. Check out this page on how to test FastAPI
application, and add a file called test_api.py
to your tests
folder with appropriate tests for your API.
This ends the module on APIs. If you want to go further in this direction we highly recommend that you check out bentoml which is an API standard that focuses solely on creating easy-to-understand APIs and services for ml-applications. Additionally, we can also highly recommend checking out Postman which can help design, document and in particular test the API you are writing to make sure that it works as expected.
"},{"location":"s7_deployment/cloud_deployment/","title":"M24 - Cloud Deployment","text":""},{"location":"s7_deployment/cloud_deployment/#cloud-deployment","title":"Cloud deployment","text":"Core Module
We are now returning to using the cloud. In this module you should have gone through the steps of having your code in your github repository to automatically build into a docker container, store that, store data and pull it all together to make a training run. After the training is completed you should hopefully have a file stored in the cloud with your trained model weights.
Todays exercises will be about serving those model weights to an end user. We focus on two different ways of deploying our model, Google cloud functions
and Google Vertex AI endpoints
.
Cloud functions are the easiest way to get started with deployment because they are what is called serverless. For serverless deployment we still need a server to do the actual workload, however the core concept is that you do you have to manage the server. Everything is magically taken care of behind the scene.
"},{"location":"s7_deployment/cloud_deployment/#exercises","title":"\u2754 Exercises","text":"Go to the start page of Cloud Functions
. Can be found in the sidebar on the homepage or you can just search for it. Activate the service if not already active.
Click the Create Function
button which should take you to a screen like the image below. Give it a name, set the server region to somewhere close by and change the authentication policy to Allow unauthenticated invocations
so we can access it directly from a browser. Remember to note down the URL of the service somewhere.
On the next page, for Runtime
pick the Python 3.9
option. This will make the inline editor show both a main.py
and requirements.py
file. Look over them. Click the Deploy
button in the lower left corner.
Afterwards you should see a green check mark beside your function meaning that it is deployed. Click the Test function
button which will take you to the testing page.
If you know what the application does, it should come as no surprise that it can run without any input. We therefore just send an empty request by clicking the Test The Function
button. Does the function return the output you expected? Wait for the logs to show up. What do they show?
What should the Triggering event
look like in the testing prompt for the program to respond with
Good day to you sir!\n
Try it out.
Click on the metrics tab. Identify what each panel is showing.
Go to the trigger tab and go to the url for the application.
Checkout the logs tab. You should see that your application have already been invoked multiple times. Also try to execute this command in a terminal:
gcloud functions logs read\n
Next, we are going to create an application that actually takes some input so we can try to send it requests. We provide a very simple sklearn_cloud_function.py script to get started.
Figure out what the script does and run the script. This should create a file with trained model.
Next create a storage bucket and upload the model file to the bucket. You can either do this through the webpage or run the following commands:
gsutil mb gs://<bucket-name> # mb stands for make bucket\ngsutil cp <file-name> gs://<bucket-name> # cp stands for copy\n
check that the file is in the bucket.
Create a new cloud function with the same initial settings as the first one. Choose also the Python 3.9
but this time change code to something that can actually use the model we just uploaded. Here is a code snippet to help you:
from google.cloud import storage\nimport pickle\n\nBUCKET_NAME = ...\nMODEL_FILE = ...\n\nclient = storage.Client()\nbucket = client.get_bucket(BUCKET_NAME)\nblob = bucket.get_blob(MODEL_FILE)\nmy_model = pickle.loads(blob.download_as_string())\n\ndef knn_classifier(request):\n \"\"\" will to stuff to your request \"\"\"\n request_json = request.get_json()\n if request_json and 'input_data' in request_json:\n data = request_json['input_data']\n input_data = list(map(int, data.split(',')))\n prediction = my_model.predict([input_data])\n return f'Belongs to class: {prediction}'\n else:\n return 'No input data received'\n
Some notes: * For locally testing the above code you will need to install the google-cloud-storage
python package * Remember to change the Entry point
* Remember to also fill out the requirements.txt
file. You need at least two packages to run the application with google-cloud-storage
being one of them. * If you deployment fails, try to go to the Logs Explorer
page in gcp
which can help you identify why.
When you have successfully deployed the model, try to make predictions with it.
You can finally try to redo the exercises deploying a Pytorch application. You will essentially need to go through the same steps as the sklearn example, including uploading a trained model to a storage, write a cloud function that loads it and return some output. You are free to choose whatever Pytorch model you want.
Cloud functions are great for simple deployments, that can be encapsulated in a single script with only simple requirements. However, they do not really scale with more advance applications that may depend on multiple programming languages. We are already familiar with how we can deal with this through containers and Cloud Run is the corresponding service in GCP for deploying containers.
"},{"location":"s7_deployment/cloud_deployment/#exercises_1","title":"\u2754 Exercises","text":"We are going to start locally by developing a small app that we can deploy. We provide two small examples to choose from: first a small FastAPI app consisting of this .py file and this dockerfile . Secondly a small streamlit application consisting of just this dockerfile . You are free to choose which application to work with.
Start by going over the files belonging to your choice app and understand what it does.
Next build the docker image belonging to the app
docker build -f <dockerfile> . -t gcp_test_app:latest\n
Next tag and push the image to your container registry
docker tag gcp_test_app gcr.io/<project-id>/gcp_test_app\ndocker push gcr.io/<project-id>/gcp_test_app\n
afterwards check you container registry to check that you have successfully pushed the image.
Next go to Cloud Run
in the cloud console an enable the service
Click the Create Service
button which should bring you to a page similar to the one below
Do the following: * Click the select button, which will bring up all build containers and pick the one you want to deploy. In the future you probably want to choose the Continuously deploy new revision from a source repository such that a new version is always deployed when a new container is build. * Hereafter, give the service a name and select the region. We recommend do choose a region close to you, however it does not really matter that much for our use case * Set the authentication method to Allow unauthenticated invocations such that we can call it without providing credentials. In the future you may only set that authenticated invocations are allowed. * Expand the Container, Connections, Security tab and edit the port such that it matches the port exposed in your chosen application.
Finally, click the create button and wait for the service to be deployed (may take some time).
If you manage to deploy the service you should see a image like this:
You can now access you application by clicking url. This will access the root of your application, so you may need to add /
or /<path>
to the url depending on how the app works.
Everything we just did to deploy an container can be reproduced using the following command:
gcloud run deploy $APP --image $TAG --platform managed --region $REGION --allow-unauthenticated\n
and checked using these two commands
gcloud run services list\ngcloud run services describe $APP --region $REGION\n
feel free to experiment doing the deployment from the command line.
Instead of deploying our docker container using the UI or command line, which is a manual operation, we can do it in a continues manner by using cloudbuild.yaml
file we learned about in the previous section. We just need to add a new step to the file. We provide an example
steps:\n# Build the container image\n- name: 'gcr.io/cloud-builders/docker'\n args: ['build', '-t', 'gcr.io/$PROJECT_ID/<container-name>:lates', '.'] #(1)!\n# Push the container image to Container Registry\n- name: 'gcr.io/cloud-builders/docker'\n args: ['push', 'gcr.io/$PROJECT_ID/<container-name>:latest']\n# Deploy container image to Cloud Run\n- name: 'gcr.io/google.com/cloudsdktool/cloud-sdk'\n entrypoint: gcloud\n args:\n - 'run'\n - 'deploy'\n - '<service-name>'\n - '--image'\n - 'gcr.io/$PROJECT_ID/<container-name>:latest'\n - '--region'\n - '<region>'\n
This line assume you are standing in the root of your repository and is trying to build the docker image specified in a file called Dockerfile
and tag it with the name gcr.io/$PROJECT_ID/my_deployment:latest
. Therefore if you want to point to another dockerfile you need to add -f
option to the command. For example if you want to point to a my_app/my_serving_app.dockerfile
you need to change the line to
args: ['build', '-f', 'my_app/my_serving_app.dockerfile', '-t', 'gcr.io/$PROJECT_ID/my_deployment:lates', '.']\n
where you need to replace <container-name>
with the name of your container, <service-name>
with the name of the service you want to deploy and <region>
with the region you want to deploy to. Afterwards you need to setup a trigger (or reuse the one you already have) to build the container and deploy it to cloud run. Confirm that this works by making a change to your application and pushing it to github and see if the application is updated continuously. For help you can look here for help. If you succeeded, congratulations you have now setup a continues deployment pipeline.
That ends the exercises on deployment. The exercises above is just a small taste of what deployment has to offer. In both sections we have explicitly chosen to work with serverless deployments. But what if you wanted to do the opposite e.g. being the one in charge of the management of the cluster that handles the deployed services? If you are really interested in taking deployment to the next level should get started on kubernetes which is the de-facto open-source container orchestration platform that is being used in production environments. If you want to deep dive we recommend starting here which describes how to make pipelines that are a necessary component before you start to create your own kubernetes cluster.
"},{"location":"s7_deployment/local_deployment/","title":"M23 - Local Deployment","text":""},{"location":"s7_deployment/local_deployment/#local-deployment","title":"Local Deployment","text":"Regardless of your application, model and usecase, the first starting point of serving your model should always be to deploy it locally. The simple reason for that is debugging: if you deploy directly to the cloud you often get less verbose error message and/or the iteration time is much slower because it simply takes much longer time to deploy to the cloud than locally. Locally should therefore always be the first step with any new application.
For this module we are going to focus on deployment of deep learning models, in particular Pytorch models which is used throughout the course. Pytorch has historically been developed for research purposed, where iterating with quick ideas was valued over fast computations. This is evident since Pytorch uses an dynamic graph underneath to represent the computational graph that is being created whenever you are running calculations. The graph is important, as it keeps track on how to do backpropergation though your Pytorch application. However, running code dynamically is notoriously slower than compiling your code before running it. Lets therefore first consider another way of compiling our code.
"},{"location":"s7_deployment/local_deployment/#compilation","title":"Compilation","text":"If you ever coded in any low-level language such as c, fortran or c++ you should be familiar with the term compiling. Compiling is the task of taken a computer program written in one language and translating it into another. In most cases this means taken whatever you have written in your preferred programming language and translating it into machine code that the computer can execute. But what does compilation have to do with coding Pytorch models?
It happens to be that Pytorch
comes with its own compiler that can optimize your model for you. It can be found in the submodule torch.jit
. Jit stands for just-in-time, meaning that compilation runs at the same time we are executing the code. If you know anything about low-level languages such c/c++ you know that we normally compile the code before we run it. With jit
we essentially merges the two phases into one. jit
has two types of compilation modes, called respective script and trace. We are in the exercises going to look at script as it is the easiest to get started with and works without any code changes for nearly all kind of models. If you ever encounter that script does not work for you then trace can be used which is more general.
The major reasons why we want to compile our models with torch.jit
are:
We are here going to look at torch.jit.script
for compiling our code.
To see the difference in the this exercises, we start out with a large model. Download one of the large image classification models from torchvision
such as ResNet-152
. For the purpose of the exercise it does not matter if you work with a random initialized model or a pretrained version.
Next try to script the model using torch.jit.script
. You can find the documentation here.
Just to confirm that by compiling our model using torch.jit.script
did not change the output of our model, try checking that the output of the scripted model corresponds to the output of the non-scripted model. You can do this on a single random datapoint, and you should check that the top-5 prediced classes are the same
assert torch.allclose(unscripted_top5_indices, scripted_top5_indices)\n
Hint: use torch.topk.
Finally, try benchmarking the non-scripted model against the scripted model. I recommend using the built-in benchmarker in Pytorch: torch.utils.benchmark.Timer
, which you can read more about how to use here. Do you see a increase in performance of the scripted model compared to the non-scriptet model. If so, what is the percentage increase in efficiency?
For locally deploying our model we are going to look at Torchserve. Torchserve (illustrated below) is a combined services for packaging and serving multiple Pytorch at the same time.
Image creditBefore we go into details of Torchmetrics, an important question is why we need such an abstraction on top of our developed model. Why can't we just do:
python inference.py --my_model model_checkpoint.pt --new_datapoint img.png\n
If we where never going to do anything else than just calling the model ourself then it is probably not worth adding anything else. However, if we ever want anyone else to interact with our model, we need to comply with standard ways of requesting and sending data. This is especially true when the next step is to start deploying our model in the cloud. Torchserve essentially brings in a inference API on top of our model that turns our model into a client-server type of system: the client (user) is going to send requests to a server (our application) and the server will give an response. The request will be send as a standard HTTP requests which Torchserve will help us decode into a useful input which we can then do inference on and return the result, again as an standardized HTTP response. Torchserve is in that regard similar to FastAPI or Flask if you have ever used one of those frameworks.
Finally, the packaging part of Torchserve is necessary because we cannot give a Torchserve a raw file of trained model weights as these essentially is just a list of floats. We need a file that both contains the model definition and the trained weights, such that the model essentially becomes independent of the python interpreter.
"},{"location":"s7_deployment/local_deployment/#exercises_1","title":"\u2754 Exercises","text":"Torchserve can be a bit rough around the edges but is fairly easy to work with. We are largely going to follow the instructions listed in the readme file for Torchserve. The intention in these exercises is to serve a Resnet type neural network that is trained for classification on ImageNet. Additional documentation can be found here.
Install torchserve
and its dependencies. There are separate instructions on the homepage depending on you are using Windows, WSL or Linux/MAC.
Create a folder called model_store
. This is where we will store the model that we are going to deploy
Try to run the torchserve --model-store model_store
command. If the service starts with no errors, you have installed it correctly and can continue the exercise. Else it is Googling time!
Next lets create a model we can serve. If you have done the previous exercises on compiling using scripting, we highly recommend to initialize and save such model
model = ResnetFromTorchVision(pretrained=True)\nscript_model = torch.jit.script(model)\nscript_model.save('deployable_model.pt')\n
Call the model archiver. We have provided a file called index_to_name.json
that maps from predicted class indices to interpretable class name e.g. 1->\"goldfish\"
. This file should be provided as the extra-files
argument such that the deployed model automatically outputs the class name. Note that this files of course only works for models trained on imagenet.
torch-model-archiver \\\n --model-name my_fancy_model\n --version 1.0 \\\n --serialized-file path/to/serialized_model.pt \\\n --export-path model_store\n --extra-files index_to_name.json\n --handler image_classifier\n
Checkout the model_store
folder. Has the model archiver correctly created a model (with .mar
extension) inside the folder?
Finally, we are going to deploy our model and use it:
Start serving your model in one terminal:
torchserve --start --ncs --model-store model_store --models my_fancy_model=my_fancy_model.mar\n
Next, pick a image that you want to do inference on. It can be any image that you want but try to pick one that actually contains an object from the set of imagenet classes. I have also provided a image of my own cat in the my_cat.jpg
file.
Open another terminal, which we are going to use for inference. The easiest way to do inference is using curl
directly in the terminal but you are also free to experiment with the requests
API directly in python. Using curl
should look something like this
curl http://127.0.0.1:8080/predictions/my_fancy_model -T my_image.jpg\n
Torchserve supports serving multiple models, not just one. Create a new vision model (either another resnet model or something similar), script it, save it, archive it in the save model store folder and then re-run torchserve like this
torchserve --start --ncs --model-store model_store --models all\n
Make sure that you can do inference with both models by calling curl
.
That ends the module on local deployment. Hopefully in this phase you have gained a bit experience with sending HTTP requests as this will be very important in the next module when we will try to deploy the models in the cloud.
"},{"location":"s8_monitoring/","title":"Monitoring","text":"Slides
We have now reached the end of our machine learning pipeline. We have successfully developed, trained and deployed a machine learning model. However, the question then becomes if you can trust that your newly deployed model still works as expected after 1 day without you intervening? What about 1 month? What about 1 year?
There may be corner cases where an ML models is working as expected, but the wast majority of ML models will perform worse over time because they are not generalizing well enough. For example, assume you have just deployed an application that classifies images from phones, when suddenly a new phone comes out with a new kind of sensor that takes images that either have very weird aspect ratio or something else your model is not robust towards. There is nothing wrong with this, you can essentially just retrain your model on new data that accounts for this corner case, however you need a mechanisms that informs you.
This is very monitoring comes into play. Monitoring practices are in charge of collecting any information about your application in some format that can then be analyzed and reacted on. Monitoring is essential to securing the longevity of your applications.
As with many other sub-fields within MLOps we can divide monitoring into classic monitoring and ML specific monitoring. Classic monitoring (known from classic DevOps) is often about
All these are basic information you are interested in regardless of what application type you are trying to deploy. However, then there are ML related monitoring that especially relates data. Take the example above, with the new phone, this we would in general consider to be data drifting problem e.g. the data you are trying to do inference on have drifted away from the distribution of data your model was trained on. Such monitoring problems are unique to machine learning applications and needs to be handled separately.
We are in this session going to see examples of both kinds of monitoring.
Learning objectives
The learning objectives of this session are:
evidently
frameworkData drifting is one of the core reasons for model accuracy degrades over time in production. For machine learning models, data drift is the change in model input data that leads to model performance degradation. In practical terms, this means that the model is receiving input that is outside of the scope that it was trained on, as seen in the figure below. This shows that the underlying distribution of a particular feature has slowly been increasing in value over two years
Image creditIn some cases, it may be that if you normalize some feature in a better way that you are able to generalize your model better, but this is not always the case. The reason for such a drift is commonly some external factor that you essentially have no control over. That really only leaves you with one option: retrain your model on the newly received input features and deploy that model to production. This process is probably going to repeat over the lifetime of your application if you want to keep it up-to-date with the real world.
Image creditWe have now come up with a solution to the data drift problem, but there is one important detail that we have not taken care of: When we should actually trigger the retraining? We do not want to wait around for our model performance to degrade, thus we need tools that can detect when we are seeing a drift in our data.
"},{"location":"s8_monitoring/data_drifting/#exercises","title":"\u2754 Exercises","text":"For these exercises we are going to use the framework Evidently developed by EvidentlyAI. Evidently currently supports both detection for both regression and classification models. The exercises are in large taken from here and in general we recommend if you are in doubt about an exercise to look at the docs for API and examples (their documentation can be a bit lacking sometimes, so you may also have to dive into the source code).
Additionally, we want to stress that data drift detection, concept drift detection etc. is still an active field of research and therefore exist multiple frameworks for doing this kind of detection. In addition to Evidently, we can also mention NannyML, WhyLogs and deepcheck.
Start by install Evidently
pip install evidently\n
you will also need scikit-learn
and pandas
installed if you do not already have it.
Hopefully you already gone through session S7 on deployment. As part of the deployment to GCP functions you should have developed a application that can classify the iris dataset, based on a model trained by this script . We are going to convert this into a FastAPI application for the purpose here:
Convert your GCP function into a FastAPI application. The appropriate curl
command should look something like this:
curl -X 'POST' \\\n 'http://127.0.0.1:8000/iris_v1/?sepal_length=1.0&sepal_width=1.0&petal_length=1.0&petal_width=1.0' \\\n -H 'accept: application/json' \\\n -d ''\n
and the response body should look like this:
{\n \"prediction\": \"Iris-Setosa\",\n \"prediction_int\": 0\n}\n
We have implemented a solution in this file (called v1) if you need help.
Next we are going to add some functionality to our application. We need to add that the input for the user is saved to a database whenever our application is called. However, to not slow down the response to our user we want to implement this as an background task. A background task is a function that should be executed after the user have got their response. Implement a background task that save the user input to a database implemented as a simple .csv
file. You can read more about background tasks here. The header of the database should look something like this:
time, sepal_length, sepal_width, petal_length, petal_width, prediction\n2022-12-28 17:24:34.045649, 1.0, 1.0, 1.0, 1.0, 1\n2022-12-28 17:24:44.026432, 2.0, 2.0, 2.0, 2.0, 1\n...\n
thus both input, timestamp and predicted value should be saved. We have implemented a solution in this file (called v2) if you need help.
Call you API a number of times to generate some dummy data in the database.
Create a new data_drift.py
file where we are going to implement the data drifting detection and reporting. Start by adding both the real iris data and your generated dummy data as pandas dataframes.
import pandas as pd\nfrom sklearn import datasets\nreference_data = datasets.load_iris(as_frame='auto').frame\ncurrent_data = pd.read_csv('prediction_database.csv')\n
if done correctly you will most likely end up with two dataframes that look like
# reference_data\nsepal length (cm) sepal width (cm) petal length (cm) petal width (cm) target\n0 5.1 3.5 1.4 0.2 0\n1 4.9 3.0 1.4 0.2 0\n...\n148 6.2 3.4 5.4 2.3 2\n149 5.9 3.0 5.1 1.8 2\n[150 rows x 5 columns]\n\n# current_data\ntime sepal_length sepal_width petal_length petal_width prediction\n2022-12-28 17:24:34.045649 1.0 1.0 1.0 1.0 1\n...\n2022-12-28 17:24:34.045649 1.0 1.0 1.0 1.0 1\n[10 rows x 5 columns]\n
Standardize the dataframes such that they have the same column names and drop the time column from the current_data
dataframe.
We are now ready to generate some reports about data drifting:
Try executing the following code:
from evidently.report import Report\nfrom evidently.metric_preset import DataDriftPreset\nreport = Report(metrics=[DataDriftPreset()])\nreport.run(reference_data=reference, current_data=current)\nreport.save_html('report.html')\n
and open the generated .html
page. What does it say about your data? Have it drifted? Make sure to poke around to understand what the different plots are actually showing.
Data drifting is not the only kind of reporting evidently can make. We can also get reports on the data quality. Try first adding a few Nan
values to your reference data. Secondly, try changing the report to
from evidently.metric_preset import DataDriftPreset, DataQualityPreset\nreport = Report(metrics=[DataDriftPreset(), DataQualityPreset()])\n
and re-run the report. Checkout the newly generated report. Again go over the generated plots and make sure that it picked up on the missing values you just added.
The final report present we will look at is the TargetDriftPreset
. Target drift means that our model is over/under predicting certain classes e.g. or general terms the distribution of predicted values differs from the ground true distribution of targets. Try adding the TargetDriftPreset
to the Report
class and re-run the analysis and inspect the result. Have your targets drifted?
Evidently reports are meant for debugging, exploration and reporting of results. However, as we stated in the beginning, what we are actually interested in methods automatically detecting when we are beginning to drift. For this we will need to look at Test and TestSuites:
Lets start with a simple test that checks if there are any missing values in our dataset:
from evidently.test_suite import TestSuite\nfrom evidently.tests import TestNumberOfMissingValues\ndata_test = TestSuite(tests=[TestNumberOfMissingValues()])\ndata_test.run(reference_data=reference, current_data=current)\n
again we could run data_test.save_html
to get a nice view of the results (feel free to try it out) but additionally we can also call data_test.as_dict()
method that will give a dict with the test results. What dictionary key contains the if all tests have passed or not?
Take a look at this colab notebook that contains all tests implemented in Evidently. Pick 5 tests of your choice, where at least 1 fails by default and implement them as a TestSuite
. Then try changing the arguments of the test so they better fit your usecase and get them all passing.
(Optional) When doing monitoring in practice, we are not always interested in running on all data collected from our API maybe only the last N
entries or maybe just from the last hour of observations. Since we are already logging the timestamps of when our API is called we can use that for filtering. Implement a simple filter that either takes an integer n
and returns the last n
entries in our database or some datetime t
that filters away observations earlier than this.
Evidently by default only supports structured data e.g. tabular data (so does nearly every other framework). Thus, the question then becomes how we can extend unstructured data such as images or text? The solution is to extract structured features from the data which we then can run the analysis on.
(Optional) For images the simple solution would be to flatten the images and consider each pixel a feature, however this does not work in practice because changes in the individual pixels does not really tell anything about the image. Instead we should derive some feature such as:
These are all numbers that can make up a feature vector for a image. Try out doing this yourself, for example by extracting such features from MNIST and FashionMNIST datasets, and check if you can detect a drift between the two sets.
(Optional) For text a common approach is to extra some higher level embedding such as the very classical GLOVE embedding. Try following this tutorial to understand how drift detection is done on text.
Lets instead take a deep learning based approach to doing this. Lets consider the CLIP model, which is normally used to do image captioning. For our purpose this is perfect because we can use the model to get abstract feature embeddings for both images and text:
from PIL import Image\nimport requests\n# requires transformers package: pip install transformers\nfrom transformers import CLIPProcessor, CLIPModel\n\nmodel = CLIPModel.from_pretrained(\"openai/clip-vit-base-patch32\")\nprocessor = CLIPProcessor.from_pretrained(\"openai/clip-vit-base-patch32\")\n\nurl = \"http://images.cocodataset.org/val2017/000000039769.jpg\"\nimage = Image.open(requests.get(url, stream=True).raw)\n\n# set either text=None or images=None when only the other is needed\ninputs = processor(text=[\"a photo of a cat\", \"a photo of a dog\"], images=image, return_tensors=\"pt\", padding=True)\n\nimg_features = model.get_image_features(inputs['pixel_values'])\ntext_features = model.get_text_features(inputs['input_ids'], inputs['attention_mask'])\n
Both img_features
and text_features
are in this case a (512,)
abstract feature embedding, that should be able to tell us something about our data distribution. Try using this method to extract features on two different datasets like CIFAR10 and SVHN if you want to work with vision or IMDB movie review and Amazon review for text. After extracting the features try running some of the data distribution testing you just learned about.
(Optional) If we have multiple applications and want to run monitoring for each application we often want also the monitoring to be a deployed application (that only we can access). Implement a /monitoring/
endpoint that does all the reporting we just went through such that you have two endpoints:
http://127.0.0.1:8000/iris_infer/?sepal_length=1.0&sepal_width=1.0&petal_length=1.0&petal_width=1.0 # user endpoint\nhttp://127.0.0.1:8000/iris_monitoring/ # monitoring endpoint\n
Our monitoring endpoint should return a HTML page either showing an Evidently report or test suit. Try implementing this endpoint. We have implemented a solution in this file if you need help with how to return an HTML page from a FastAPI application.
As an final exercise, we recommend that you try implementing this to run directly in the cloud. You will need to implement this in a container e.g. GCP Run service because the data gathering from the endpoint should still be implemented as an background task. For this to work you will need to change the following:
Instead of saving the input to a local file you should either store it in GCP bucket or an BigQuery SQL table (this is a better solution, but also out-of-scope for this course)
You can either run the data analysis locally by just pulling from cloud storage predictions and training data or alternatively you can deploy this as its own endpoint that can be invoked. For the latter option we recommend that this should require authentication.
That ends the module on detection of data drifting, data quality etc. If this has not already been made clear, monitoring of machine learning applications is an extremely hard discipline because it is not a clear cut when we should actually respond to feature beginning to drift and when it is probably fine. That comes down to the individual application what kind of rules that should be implemented. Additionally, the tools presented here are also in no way complete and are especially limited in one way: they are only considering the marginal distribution of data. Every analysis that we done have been on the distribution per feature (the marginal distribution), however as the image below show it is possible for data to have drifted to another distribution with the marginal being approximately the same.
There are methods such as Maximum Mean Discrepancy (MMD) tests that are able to do testing on multivariate distributions, which you are free to dive into. In this course we will just always recommend to consider multiple features when doing decision regarding your deployed applications.
"},{"location":"s8_monitoring/monitoring/","title":"M26 - System Monitoring","text":""},{"location":"s8_monitoring/monitoring/#monitoring","title":"Monitoring","text":"In this module we are going to look into more classical monitoring of applications. The key concept we are often working with here is called telemetry. Telemetry in general refer to any automatic measurement and wireless transmission of data from our application. It could be numbers such as:
In general there are three different kinds of telemetry we are interested in:
Name Description Example Purpose Metrics Metrics are quantitative measurements of the system. They are usually numbers that are aggregated over a period of time. E.g. the number of requests per minute. The number of requests per minute. Metrics are used to get an overview of the system. They are often used to create dashboards that can be used to get an overview of the system. Logs Logs are textual or structured records generated by applications. They provide a detailed account of events, errors, warnings, and informational messages that occur during the operation of the system. System logs, error logs Logs are essential for diagnosing issues, debugging, and auditing. They provide a detailed history of what happened in a system, making it easier to trace the root cause of problems and track the behavior of components over time. Traces Traces are detailed records of specific transactions or events as they move through a system. A trace typically includes information about the sequence of operations, timing, and dependencies between different components. Distributed tracing in microservices architecture Traces help in understanding the flow of a request or a transaction across different components. They are valuable for identifying bottlenecks, understanding latency, and troubleshooting issues related to the flow of data or control.We are mainly going to focus in this module on metrics.
"},{"location":"s8_monitoring/monitoring/#local-instrumentator","title":"Local instrumentator","text":"Before we look into the cloud lets at least conceptually understand how a given instance of a app can expose values that we may be interested in monitoring.
The standard framework for exposing metrics is called prometheus. Prometheus is a time series database that is designed to store metrics. It is also designed to be very easy to instrument applications with and it is designed to scale to large amounts of data. The way prometheus works is that it exposes a /metrics
endpoint that can be queried to get the current state of the metrics. The metrics are exposed in a format called prometheus text format.
Start by installing prometheus-fastapi-instrumentator
in python
pip install prometheus-fastapi-instrumentator\n
this will allow us to easily instrument our FastAPI application with prometheus.
Create a simple FastAPI application in a file called app.py
. You can reuse any application from the previous module on APIs. To that file now add the following code:
from prometheus_fastapi_instrumentator import Instrumentator\n\n# your app code here\n\nInstrumentator().instrument(app).expose(app)\n
This will instrument your application with prometheus and expose the metrics on the /metrics
endpoint.
Run the app using uvicorn
server. Make sure that the app exposes the endpoints you expect it too exposes, but make sure you also checkout the /metrics
endpoint.
The metric endpoint exposes multiple /metrics
. Metrics always looks like this:
# TYPE key <type>\nkey value\n
e.g. it is essentially a ditionary of key-value pairs with the added functionality of a <type>
. Look at this page over the different types prometheus metrics can have and try to understand the different metrics being exposed.
Look at the documentation for the prometheus-fastapi-instrumentator
and try to add at least one more metric to your application. Rerun the application and confirm that the new metric is being exposed.
Any cloud system with respect for itself will have some kind of monitoring system. GCP has a service called Monitoring that is designed to monitor all the different services. By default it will monitor a lot of metrics out-of-box. However, the question is if we want to monitor more than the default metrics. The complexity that comes with doing monitoring in the cloud is that we need more than one container. We at least need one container actually running the application that is also exposing the /metrics
endpoint and then we need a another container that is collecting the metrics from the first container and storing them in a database. To implement such system of containers that need to talk to each others we in general need to use a container orchestration system such as Kubernetes. This is out of scope for this course, but we can use a feature of Cloud Run
called sidecar containers
to achieve the same effect. A sidecar container is a container that is running alongside the main container and can be used to do things such as collecting metrics.
Overall we recommend that you just become familiar with the monitoring tab for your cloud run service (see image) above. Try to invoke your service a couple of times and see what happens to the metrics over time.
Try creating a service level objective (SLO). In short a SLO is a target for how well your application should be performing. Click the Create SLO
button and fill it out with what you consider to be a good SLO for your application.
(Optional) To expose our own metrics we need to setup a sidecar container. To do this follow the instructions here. We have setup a simple example that uses fastapi and prometheus that you can find here. After you have correctly setup the sidecar container you should be able to see the metrics in the monitoring tab.
A core problem within monitoring is alert systems. The alert system is in charge of sending out alerts to relevant people when some telemetry or metric we are tracking is not behaving as it should. Alert systems are a subjective choice of when and how many should be send out and in general should be proportional with how important to the of the metric/telemetry. We commonly run into what is referred to the goldielock problem where we want just the right amount of alerts however it is more often the case that we either have
Therefore, setting up proper alert systems can be as challenging as setting up the systems for actually the metrics we want to trigger alerts.
"},{"location":"s8_monitoring/monitoring/#exercises_2","title":"\u2754 Exercises","text":"We are in this exercise going to look at how we can setup automatic alerting such that we get an message every time one of our applications are not behaving as expected.
Go to the Monitoring
service. Then go to Alerting
tab.
Start by setting up an notification channel. A recommend setting up with an email.
Next lets create a policy. Clicking the Add Condition
should bring up a window as below. You are free to setup the condition as you want but the image is one way bo setup an alert that will react to the number of times an cloud function is invoked (actually it measures the amount of log entries from cloud functions).
After adding the condition, add the notification channel you created in one of the earlier steps. Remember to also add some documentation that should be send with the alert to better describe what the alert is actually doing.
When the alert is setup you need to trigger it. If you setup the condition as the image above you just need to invoke the cloud function many times. Here is a small code snippet that you can execute on your laptop to call a cloud function many time (you need to change the url and payload depending on your function):
import time\nimport requests\nurl = 'https://us-central1-dtumlops-335110.cloudfunctions.net/function-2'\npayload = {'message': 'Hello, General Kenobi'}\n\nfor _ in range(1000):\n r = requests.get(url, params=payload)\n
Make sure that you get the alert through the notification channel you setup.
Slides
This module is all about scaling the applications that we are building. We are here going to use a very narrow definition of scaling namely that we want our applications to run faster, however one should note that in general scaling is a much broader term. There are many different ways to scale your applications and we are going to look at three of these related to different tasks machine learning algorithms:
We are going to approach the term scaling from two different angles that both should result in your application running faster. The first approach is levering multiple devices, such as using multiple CPU cores or parallelizing training across multiple GPUs. The second approach is more analytical, were we are actually going to look at how we can design smaller/faster model architectures that runs faster.
It should be noted that this module is specific to working with Pytorch applications. In particular we are going to see how we can both improve base Pytorch code and how to utilize the Pytorch Lightning which we introduced in module M14 on boilerplate to improve the scaling of our applications. If your application is written using another framework we can guarantee that the same techniques in these modules transfers to that framework, but may require you do seek out how to specifically to it.
If you manage to complete all modules in this session, feel free to checkout the extra module on scalable hyperparameter optimization.
Learning objectives
The learning objectives of this session are:
pytorch-lightning
Core Module
One way that deep learning fundamentally changed the way we think about data in machine learning is that more data is always better. This was very much not the case with more traditional machine learning algorithms (random forest, support vector machines etc.) where a pleatau in performance was often reached for a certain amount of data and did not improve if more was added. However, as deep learning models have become deeper and deeper and thereby more and more data hungry performance seems to be ever increasing or at least not reaching a pleatau in the same way as for traditional machine learning.
Image creditAs we are trying to feed more and more data into our models and obvious first question to ask is how to do this in a efficient way. As an general rule of thumb we want the performance bottleneck to be the forward/backward e.g. the actual computation in our neural network and not the data loading. By bottleneck we here refer to the part of our pipeline that is restricting how fast we can process data. If data loading is our bottleneck, then our compute device can sit idle while waiting for data to arrive, which is both inefficient and costly. For example if you are using a cloud provider for training deep learning models, you are paying by the hour per device, and thus not using them fully can be costly in the long run.
In the first set of exercises we are therefore going to focus on distributed data loading i.e. how do load data in parallel to make sure that we always have data ready for our compute devices. We are in the following going to look at what is going on behind the scene when we use Pytorch to parallelize data loading.
"},{"location":"s9_scalable_applications/data_loading/#a-closer-look-on-data-loading","title":"A closer look on Data loading","text":"Before we talk distributed applications it is important to understand the physical layout of a standard CPU (the brain of your computer).
Most modern CPUs is a single chip that consist of multiple cores. Each core can further be divided into threads. In most laptops the core count is 4 and commonly 2 threads per code. This means that the common laptop have 8 threads. The number of threads a compute unit has is important, because that directly corresponds to the number of parallel operations that can be executed i.e. one per thread. In a Python terminal you should be able to get the number of cores in your machine by writing (try it):
import multiprocessing\ncores = multiprocessing.cpu_count()\nprint(f\"Number of cores: {cores}, Number of threads: {2*cores}\")\n
A distributed application is in general any kind of application that parallelizes some or all of it workload. We are in these exercises only focusing on distributed data loading, which happens primarily only on the CPU. In Pytorch
it is easy to parallelize data loading if you are using their dataset/dataloader interface:
from torch.utils.data import Dataset, DataLoader\nclass MyDataset(Dataset):\n def __init__(self, ...):\n # whatever logic is needed to init the data set\n self.data = ...\n\n def __getitem__(self, idx):\n # return one item\n return self.data[idx]\n\ndataset = MyDataset()\ndataloader = Dataloader(\n dataset,\n batch_size=8,\n num_workers=4 # this is the number of threads we want to parallelize workload over\n)\n
Lets take a deep dive into what happens when we request a batch from our dataloader e.g. next(dataloader)
. First we must understand that we have a thread that plays the role of the main and the remaining threads (in the above example we request 4) are called workers. When the dataloader is created, we create this structure and make sure that all threads have a copy of our dataset definition so each can call the __getitem__
method.
Then comes the actual part where we request a batch for data. Assume that we have a batch size of 8 and we do not do any shuffeling. In this step the master thread then distributes the list of requested data points ([0,1,2,3,4,5,6,7]
) to the four worker threads. With 8 indices and 4 workers, each worker will receive 2 indices.
Each worker thread then calls __getitem__
method for all the indices it has received. When all processes are done, the loaded images datapoints gets send back to the master thread collected into a single structure/tensor.
Each arrow is corresponds to a communication between two threads, which is not a free operations. In total to get a single batch (not counting the initial startup cost) in this example we need to do 8 communication operations. This may seem like a small price to pay, but that may not be the case. If the process time of __getitem__
is very low (data is stored in memory, we just need to index to get it) then it does not make sense to use multiprocessing. The computationally saving by doing the look-up operations in parallel is smaller than the communication cost there is between the main thread and the workers. Multiprocessing makes sense when the process time of __getitem__
is high (data is probably stored on the harddrive).
It is this trade-off that we are going to investigate in the exercises.
"},{"location":"s9_scalable_applications/data_loading/#exercises","title":"\u2754 Exercises","text":"This exercise is intended to be done on the labeled faces in the wild (LFW) dataset. The dataset consist images of famous people extracted from the internet. The dataset had been used to drive the field of facial verification, which you can read more about here. We are going imagine that this dataset cannot fit in memory, and your job is therefore to construct a data pipeline that can be parallelized based on loading the raw datafiles (.jpg) at runtime.
Download the dataset and extract to a folder. It does not matter if you choose the non-aligned or aligned version of the dataset.
We provide the lfw_dataset.py
file where we have started the process of defining a data class. Fill out the __init__
, __len__
and __getitem__
. Note that __getitem__
expect that you return a single img
which should be a torch.Tensor
. Loading should be done using PIL Image, as PIL
images is the default input format for torchvision for transforms (for data augmentation).
Make sure that the script runs without any additional arguments
python lfw_dataset.py\n
Visualize a single batch by filling out the codeblock after the first TODO right after defining the dataloader. The visualization should show when launching the script as
python lfw_dataset.py -visualize_batch\n
Hint: this tutorial.
Experiment how the number of workers influences the performance. We have already provide code that will pass over 100 batches from the dataset 5 times and calculate how long time it took, which you can play around with by calling
python lfw_dataset.py -get_timing -num_workers 1\n
Make a errorbar plot with number of workers along the x-axis and the timing along the y-axis. The errorbars should correspond to the standard deviation over the 5 runs. HINT: if it is taking too long to evaluate, measure the time over less batches (set the -batches_to_check
flag). Also if you are not seeing an improvement, try increasing the batch size (since data loading is parallelized per batch).
For certain machines like the Mac with M1 chipset it is necessary to set the multiprocessing_context
flag in the dataloder to \"fork\"
. This essentially tells the dataloader how the worker nodes should be created.
Retry the experiment where you change the data augmentation to be more complex:
lfw_trans = transforms.Compose([\n transforms.RandomAffine(5, (0.1, 0.1), (0.5, 2.0)),\n # add more transforms here\n transforms.ToTensor()\n])\n
by making the augmentation more computationally demanding, it should be easier to get an boost in performance when using multiple workers because the data augmentation is also executed in parallel.
(Optional, requires access to GPU) If your dataset fits in GPU memory it is beneficial to set the pin_memory
flag to True
. By setting this flag we are essentially telling Pytorch that they can lock the data in-place in memory which will make the transfer between the host (CPU) and the device (GPU) faster.
This ends the module on distributed data loading in Pytorch. If you want to go into more details we highly recommend that you read this paper that goes into great details on analyzing on how data loading in Pytorch work and performance benchmarks.
"},{"location":"s9_scalable_applications/distributed_training/","title":"M28 - Distributed Training","text":""},{"location":"s9_scalable_applications/distributed_training/#distributed-training","title":"Distributed Training","text":"In this module we are going to look at distributed training. Distributed training is one of the key ingredients to all the awesome results that deep learning models are producing. For example: Alphafold the highly praised model from Deepmind that seems to have solved protein structure prediction, was trained in a distributed fashion for a few weeks. The training was done on 16 TPUv3s (specialized hardware), which is approximately equal to 100-200 modern GPUs. This means that training Alphafold without distributed training on a single GPU (probably not even possible) would take a couple of years to train! Therefore, it is simply impossible currently to train some of the state-of-the-art (SOTA) models within deep learning currently, without taking advantage of distributed training.
When we talk about distributed training, there are a number of different paradigms that we may use to parallelize our computations
In this module we are going to look at data parallel training, which is the original way of doing parallel training and distributed data parallel training which is an improved version of data parallel. If you want to know more about sharded training which is the newest of the paradigms you can read more about it in this blog post, which describes how sharded can save over 60% of memory used during your training.
Finally, we want to note that for all the exercises in the module you are going to need a multi GPU setup. If you have not already gained access to multi GPU machines on GCP (see the quotas exercises in this module) you will need to find another way of running the exercises. For DTU Students I can recommend checking out this optional module on using the high performance cluster (HPC) where you can get access to multi GPU resources.
"},{"location":"s9_scalable_applications/distributed_training/#data-parallel","title":"Data parallel","text":"While data parallel today in general is seen as obsolete compared to distributed data parallel, we are still going to investigate it a bit since it offers the most simple form of distributed computations in deep learning pipeline.
In the figure below is shown both the forward and backward step in the data parallel paradigm
The steps are the following:
Whenever we try to do forward call e.g. out=model(batch)
we take the batch and divide it equally between all devices. If we have a batch size of N
and M
devices each device will be sent N/M
datapoints.
Afterwards each device receives a copy of the model
e.g. a copy of the weights that currently parametrizes our neural network.
In this step we perform the actual forward pass in parallel. This is the actual steps that can help us scale our training.
Finally we need to send back the output of each replicated model to the primary device.
Similar to the analysis we did of parallel data loading, we cannot always expect that this will actual take less time than doing the forward call on a single GPU. If we are parallelizing over M
devices, we essentially need to do 3xM
communication calls to send batch, model and output between the devices. If the parallel forward call does not outweigh this, then it will take longer.
In addition, we also have the backward path to focus on
As the end of the forward collected the output on the primary device, this is also where the loss is accumulated. Thus, loss gradients are first calculated on the primary device
Next we scatter the gradient to all the workers
The workers then perform a parallel backward pass through their individual model
Finally, we reduce (sum) the gradients from all the workers on the main process such that we can do gradient descend.
One of the big downsides of using data parallel is that all the replicas are destroyed after each backward call. This means that we over and over again need to replicate our model and send it to the devices that are part of the computations.
Even though it seems like a lot of logic is implementing data parallel into your code, in Pytorch we can very simply enable data parallel training by wrapping our model in the nn.DataParallel class.
from torch import nn\nmodel = MyModelClass()\nmodel = nn.DataParallel(model, device_ids=[0, 1]) # data parallel on gpu 0 and 1\npreds = model(input) # same as usual\n
"},{"location":"s9_scalable_applications/distributed_training/#exercises","title":"\u2754 Exercises","text":"Please note that the exercise only makes sense if you have access to multiple GPUs.
Create a new script (call it data_parallel.py
) where you take a copy of model FashionCNN
from the fashion_mnist.py
script. Instantiate the model and wrap torch.nn.DataParallel
around it such that it can be executed in data parallel.
Try to run inference in parallel on multiple devices (pass a batch multiple times and time it) e.g.
import time\nstart = time.time()\nfor _ in range(n_reps):\n out = model(batch)\nend = time.time()\n
Does data parallel decrease the inference time? If no, can you explain why that may be? Try playing around with the batch size, and see if data parallel is more beneficial for larger batch sizes.
It should be clear that there is huge disadvantage of using the data parallel paradigm to scale your applications: the model needs to replicated on each pass (because it is destroyed in the end), which requires a large transfer of data. This is the main problem that distributed data parallel tries to solve.
The two key difference between distributed data parallel and data parallel that we move the model update (the gradient step) to happen on each device in parallel instead of only on the main device. This has the consequence that we do not need to move replicate the model on each step, instead we just keep a local version on each device that we keep updating. The full set of steps (as shown in the figure):
Initialize an exact copy of the model on each device
From disk (or memory) we start by loading data into a section of page-locked host memory per device. Page-locked memory is essentially a way to reverse a piece of a computers memory for a specific transfer that is going to happen over and over again to speed it up. The page-locked regions are loaded with non-overlapping data.
Transfer data from page-locked memory to each device in parallel
Perform forward pass in parallel
Do a all-reduce operation on the gradients. An all-reduce operation is a so call all-to-all operation meaning that all processes send their own gradient to all other processes and also received from all other processes.
Reduce the combined gradient signal from all processes and update the individual model in parallel. Since all processes received the same gradient information, all models will still be in sync.
Thus, in distributed data parallel we here end up only doing a single communication call between all processes, compared to all the communication going on in data parallel. While all-reduce is a more expensive operation that many of the other communication operations that we can do, because we only have to do a single we gain a huge performance boost. Empirically distributed data parallel tends to be 2-3 times faster than data parallel.
However, this performance increase does not come for free. Where we could implement data parallel in a single line in Pytorch, distributed data parallel is much more involving.
"},{"location":"s9_scalable_applications/distributed_training/#exercises_1","title":"\u2754 Exercises","text":"We have provided an example of how to do distributed data parallel training in Pytorch in the two files distributed_example.py
and distributed_example.sh
. You objective is to get a understanding of the necessary components in the script to get this kind of distributed training to work. Try to answer the following questions (HINT: try to Google around):
What is the function of the DDP
wrapper?
What is the function of the DistributedSampler
?
Why is it necessary to call dist.barrier()
before passing a batch into the model?
What does the different environment variables do in the .sh
file
Try to benchmark the runs using 1 and 2 GPUs
The first exercise have hopefully convinced you that it can be quite the trouble writing distributed training applications yourself. Luckily for us, Pytorch-lightning
can take care of this for us such that we do not have to care about the specific details. To get your model training on multiple GPUs you need to change two arguments in the trainer: the accelerator
flag and the gpus
flag. In addition to this, you can read through this guide about any additional steps you may need to do (for many of you, it should just work). Try running your model on multiple GPUs.
Try benchmarking your training using 1 and 2 gpus e.g. try running a couple of epochs and measure how long time it takes. How much of a speedup can you actually get? Why can you not get a speedup of 2?
Inference is task of applying our trained model to some new and unseen data, often called prediction. Thus, scaling inference is different from scaling data loading and training, mainly due to inference normally only using a single data point (or a few). As we can neither parallelize the data loading or parallelize using multiple GPUs (at least not in any efficient way), this is of no use to us when we are doing inference. Secondly, inference is often not something we do on machines that can perform large computations, as most inference today is actually either done on edge devices e.g. mobile phones or in low-cost-low-compute cloud environments. Thus, we need to be smarter about how we scale inference than just throwing more compute at it.
In this module we are going to look at various ways that you can either reduce the size of your model and or make your model faster. Both are important for running inference fast regardless of the setup you are running your model on. We want to note that this is still very much an active area of research and therefore best practices for what to do in a specific situation can change.
"},{"location":"s9_scalable_applications/inference/#choosing-the-right-architecture","title":"Choosing the right architecture","text":"Assume you are starting a completely new project and have to come up with a model architecture for doing this. What is you strategy? The common way to do this, is to look at prior work on similar problems that you are facing and either directly choosing the same architecture or creating some slight variation hereof. This is a great way to get started, but the architecture that you end up choosing may be optimal in terms of performance but not inference speed.
The fact is that not all base architectures are created equal, and a 10K parameter model with one architecture can have significantly different inference speed than another 10K parameter model with another architecture. For example, consider the figure below which compares a number of models from the [timm] package, colored based on their base architecture. The general trend is that the number of images that can be processed by a model per sec (y-axis) is inverse proportional to the number of parameters (x-axis). However, we in general see that convolutional base architectures (conv) are more efficient than transformer (vit) for the same parameter budget.
Image credit"},{"location":"s9_scalable_applications/inference/#exercises","title":"\u2754 Exercises","text":"As dissed in this blogpost the largest increase in inference speed you will see (given some specific hardware) is choosing an efficient model architectures. In the exercises below we are going to investigate the inference speed of different architectures.
Start by checking out this table which contains a list of pretrained weights in torchvision
. Try finding an
model that have in the range of 20-30 mio parameters.
Write a small script that initialize all models and does inference with them. It should look something like this
import time\nfrom torchvision import models\n\nm1 = models.ModelArchitechture1()\nm2 = models.ModelArchitechture2()\nm3 = models.ModelArchitechture3()\n\ninput = torch.randn(100, 3, 256, 256)\n\nfor i, m in enumerate([m1, m2, m3]):\n tic = time.time()\n for _ in range(n_reps):\n _ = m(input)\n toc = time.time()\n print(f\"Model {i} took: {(toc - tic) / n_reps}\")\n
Does the results make sense? Based on the above figure we would expect that efficientnet is faster than resnet, which is faster than the transformer based model. Is this also what you are seeing?
To figure out why one net is more efficient than another we can try to count the operations each network need to do for inference. A operation here we can define as a FLOP (floating point operation) which is any mathematical operation (such as +, -, *, /) or assignment that involves floating-point numbers. Luckily for us someone has already created a python package for calculating this in pytorch: ptflops
Install the package
pip install ptflops\n
Try calling the get_model_complexity_info
function from the ptflops
package on the networks from the previous exercise. What are the results?
In the table from the initial exercise, you could also see the overall performance of each network on the Imagenet-1K dataset. Given this performance, the inference speed, the flops count what network would you choose to use in a production setting? Discuss when choosing one over another should be considered.
Quantization is a technique where all computations are performed with integers instead of floats. We are essentially taking all continuous signals and converting them into discretized signals.
Image creditAs discussed in this blogpost series, while float
(32-bit) is the primarily used precision in machine learning because is strikes a good balance between memory consumption, precision and computational requirement it does not mean that during inference we can take advantage of quantization to improve the speed of our model. For instance:
Floating-point computations are slower than integer operations
Recent hardware have specialized hardware for doing integer operations
Many neural networks are actually not bottlenecked by how many computations they need to do but by how fast we can transfer data e.g. the memory bandwidth and cache of your system is the limiting factor. Therefore working with 8-bit integers vs 32-bit floats means that we can approximately move data around 4 times as fast.
Storing models in integers instead of floats save us approximately 75% of the ram/harddisk space whenever we save a checkpoint. This is especially useful in relation to deploying models using docker (as you hopefully remember) as it will lower the size of our docker images.
But how do we convert between floats and integers in quantization? In most cases we often use a linear affine quantization:
$$ x_{int} = \\text{round}\\left( \\frac{x_{float}}{s} + z \\right) $$
where $s$ is a scale and $z$ is the so called zero point. But how does to doing inference in a neural network. The figure below shows all the conversations that we need to make to our standard inference pipeline to actually do computations in quantized format.
Image credit"},{"location":"s9_scalable_applications/inference/#exercises_1","title":"\u2754 Exercises","text":"Lets look at how quantized tensors look in Pytorch
Start by creating a tensor that contains both random numbers
Next call the torch.quantize_per_tensor
function on the tensor. What does the quantized tensor look like? How does the values relate to the scale
and zero_point
arguments.
Finally, try to call the .dequantize()
method on the tensor. Do you get a tensor back that is close to what you initially started out with.
As you hopefully saw in the first exercise we are going to perform a number of rounding errors when doing quantization and naively we would expect that this would accumulate and lead to a much worse model. However, in practice we observe that quantization still works, and we actually have a mathematically sound reason for this. Can you figure out why quantization still works with all the small rounding errors? HINT: it has to do with the central limit theorem
Lets move on to quantization of our model. Follow this tutorial from Pytorch on how to do quantization. The goal is to construct a model model_fc32
that works on normal floats and a quantized version model_int8
. For simplicity you can just use one of the models from the tutorial.
Lets try to benchmark our quantized model and see if all the trouble that we went through actually paid of. Also try to perform the benchmark on the non-quantized model and see if you get a difference. If you do not get an improvement, explain why that may be.
Pruning is another way for reducing the model size and maybe improve performance of our network. As the figure below illustrates, in pruning we are simply removing weights in our network that we do not consider important for the task at hand. By removing, we here mean that the weight gets set to 0. There are many ways to determine if a weight is important but the general rule that the importance of a weight is proportional to the magnitude of a given weight. This makes intuitively sense, since weights in all linear operations (fully connected or convolutional) are always multiplied onto the incoming value, thus a small weight means a small outgoing activation.
Image credit"},{"location":"s9_scalable_applications/inference/#exercises_2","title":"\u2754 Exercises","text":"We provide a start script that implements the famous LeNet in this file. Open and run it just to make sure that you know the network.
Pytorch have already some pruning methods implemented in its package. Import the prune
module from torch.nn.utils
in the script.
Try to prune the weights of the first convolutional layer by calling
prune.random_unstructured(module_1, name=\"weight\", amount=0.3) # (1)!\n
Try printing the named_parameters
, named_buffers
before and after the module is pruned. Can you explain the difference and what is the connection to the module_1.weight
attribute.
Try pruning the bias of the same module this time using the l1_unstructured
function from the pruning module. Again check the named_parameters
, named_buffers
argument to make sure you understand the difference between L1 pruning and unstructured pruning.
Instead of pruning only a single module in the model lets try pruning the hole model. To do this we just need to iterate over all named_modules
in the model like this:
for name, module in new_model.named_modules():\n prune.l1_unstructured(module, name='weight', amount=0.2)\n
But what if we wanted to apply different pruning to different layers. Implement a pruning scheme where
amount=0.2
amount=0.4
Print print(dict(new_model.named_buffers()).keys())
after the pruning to confirm that all weights have been correctly pruned.
The pruning we have looked at until know have only been local in nature e.g. we have applied the pruning independently for each layer, not accounting globally for how much we should actually prune. As you may realize this can quickly lead to an network that is pruned too much. Instead, the more common approach is too prune globally where we remove the smallest X
amount of connections:
Start by creating a tuple over all the weights with the following format
parameters_to_prune = (\n (model.conv1, 'weight'),\n # fill in the rest of the modules yourself\n (model.fc3, 'weight'),\n)\n
The tuple needs to have length 5. Challenge: Can you construct the tuple using for
loops, such that the code works for arbitrary size networks?
Next prune using the global_unstructured
function to globally prune the tuple of parameters
prune.global_unstructured(\n parameters_to_prune,\n pruning_method=prune.L1Unstructured,\n amount=0.2,\n)\n
Check that the amount that have been pruned is actually equal to the 20% specified in the pruning. We provide the following function that for a given submodule (for example model.conv1
) computes the amount of pruned weights
def check_prune_level(module: nn.Module):\n sparsity_level = 100 * float(torch.sum(module.weight == 0) / module.weight.numel())\n print(f\"Sparsity level of module {sparsity_level}\")\n
With a pruned network we really want to see if all our effort actually ended up with a network that is faster and/or smaller in memory. Do the following to the globally pruned network from the previous exercises:
First we need to make the pruning of our network permanent. Right now it is only semi-permanent as we are still keeping a copy of the original weights in memory. Make the change permanent by calling prune.remove
on every pruned module in the model. Hint: iterate over the parameters_to_prune
tuple.
Next try to measure the time of a single inference (repeated 100 times) for both the pruned and non-pruned network
import time\ntic = time.time()\nfor _ in range(100):\n _ = network(torch.randn(100, 1, 28, 28))\ntoc = time.time()\nprint(toc - tic)\n
Is the pruned network actually faster? If not can you explain why?
Next lets measure the size of our network (called pruned_network
) and a freshly initialized network (called network
):
torch.save(pruned_network.state_dict(), 'pruned_network.pt')\ntorch.save(network.state_dict(), 'network.pt')\n
Lookup the size of each file. Are the pruned network actually smaller? If not can you explain why?
Repeat the last exercise, but this time start by converting all pruned weights to sparse format first by calling the .to_sparse()
method on each pruned weight. Is the saved model smaller now?
This ends the exercises on pruning. As you probably realized in the last couple of exercises, then pruning does not guarantee speedups out of the box. This is because linear operations in Pytorch does not handle sparse structures out of the box. To actually get speedups we would need to deep dive into the sparse tensor operations, which again does not even guarantee that a speedup because the performance of these operations depends on the sparsity structure of the pruned weights. Investigating this is out of scope for these exercises, but we highly recommend checking it out if you are interested in sparse networks.
"},{"location":"s9_scalable_applications/inference/#knowledge-distillation","title":"Knowledge distillation","text":"Knowledge distillation is somewhat similar to pruning, in the sense that it tries to find a smaller model that can perform equally well as a large model, however it does so in a completely different way. Knowledge distillation is a model compression technique that builds on the work of Bucila et al. in which we try do distill/compress the knowledge of a large complex model (also called the teacher model) into a simpler model (also called the student model).
The best known example of this is the DistilBERT model. The DistilBERT model is a smaller version of the large natural-language procession model Bert, which achieves 97% of the performance of Bert while only containing 40% of the weights and being 60% faster. You can see in the figure below how it is much smaller in size compared to other models developed at the same time.
Image creditKnowledge distillation works by assuming we have a big teacher that is already performing well that we want to compress. By running our training set through our large model we get a softmax distribution for each and every training sample. The goal of the students, is to both match the original labels of the training data but also match the softmax distribution of the teacher model. The intuition behind doing this, is that teacher model needs to be more complex to learn the complex inter-class relasionship from just (one-hot) labels. The student on the other hand gets directly feed with softmax distributions from the teacher that explicit encodes this inter-class relasionship and thus does not need the same capasity to learn the same as the teacher.
Image credit"},{"location":"s9_scalable_applications/inference/#exercises_3","title":"\u2754 Exercises","text":"Lets try implementing model distillation ourself. We are going to see if we can achieve this on the cifar10 dataset. Do note that exercise below can take quite long time to finish because it involves training multiple networks and therefore involve some waiting.
Start by install the transformers
and datasets
packages from Huggingface
pip install transformers\npip install datasets\n
which we are going to download the cifar10 dataset and a teacher model.
Next download the cifar10 dataset
from datasets import load_dataset\ndataset = load_dataset(\"cifar10\")\n
Next lets initialize our teacher model. For this we consider a large transformer based model:
from transformers import AutoFeatureExtractor, AutoModelForImageClassification\nextractor = AutoFeatureExtractor.from_pretrained(\"aaraki/vit-base-patch16-224-in21k-finetuned-cifar10\")\nmodel = AutoModelForImageClassification.from_pretrained(\"aaraki/vit-base-patch16-224-in21k-finetuned-cifar10\")\n
To get the logits (un-normalized softmax scores) from our teacher model for a single datapoint from the training dataset you would extract it like this:
sample_img = dataset['train'][0]['img']\npreprocessed_img = extractor(dataset['train'][0]['img'], return_tensors='pt')\noutput = model(**preprocessed_img)\nprint(output.logits)\n# tensor([[ 3.3682, -0.3160, -0.2798, -0.5006, -0.5529, -0.5625, -0.6144, -0.4671, 0.2807, -0.3066]])\n
Repeat this process for the hole training dataset and store the result somewhere.
Implement a simple convolutional model. You can create a custom one yourself or use a small one from torchvision
.
Train the model on cifar10 to convergence, so you have a base result on how the model is performing.
Redo the training, but this time add knowledge distillation to your training objective. It should look like this:
for batch in dataset:\n # ...\n img, target, teacher_logits = batch\n preds = model(img)\n loss = torch.nn.functional.cross_entropy(preds, target)\n loss_teacher = torch.nn.functional.cross_entropy(preds, teacher_logits)\n loss = loss + loss_teacher\n loss.backward()\n # ...\n
Compare the final performance obtained with and without knowledge distillation. Did the performance improve or not?
This ends the module on scaling inference in machine learning models.
"},{"location":"tools/","title":"Tools","text":"Just a collection of tools and scripts for running the course.
"}]} \ No newline at end of file diff --git a/sitemap.xml.gz b/sitemap.xml.gz index 6ca664eb3..0519d0396 100644 Binary files a/sitemap.xml.gz and b/sitemap.xml.gz differ