L24 example time

jzarnett · Sep 13, 2023 · 1cf699c · 1cf699c
1 parent daa88ff
commit 1cf699c
Show file tree

Hide file tree

Showing 3 changed files with 58 additions and 2 deletions.
diff --git a/lectures/459.bib b/lectures/459.bib
@@ -1301,4 +1301,14 @@ @misc{hf
   year = 	 2023,
   url = {https://huggingface.co/docs/transformers/perf_train_gpu_one},
   note = {Online; accessed 2023-09-11}
-}
+}
+
+@misc{hf2,
+  author = 	 {Hugging Face},
+  title = 	 {{Model Training Anatomy (v. 4.33.0)}},
+  month = 	 {September},
+  year = 	 2023,
+  url = {https://huggingface.co/docs/transformers/model_memory_anatomy},
+  note = {Online; accessed 2023-09-13}
+}
+
diff --git a/lectures/L24.tex b/lectures/L24.tex
@@ -19,12 +19,18 @@ \section*{Large Language Models and You}
 \subsection*{Optimizing LLMs}
 The content from this section is based on a guide from ``Hugging Face'' which describes itself as an AI community that wants to democratize the technology. The guide in question is about methods and tools for training using one GPU~\cite{hf} (but we can discuss multi-GPU also). Indeed, you may have guessed by the placement of this topic in the course material that the GPU is the right choice for how to generate or train a large language model. 
 
-There are two kinds of optimizations that are worth talking about. The first one is the idea of model performance: how do we generate a model that gives answers or predictions quickly? The second is how can we generate or train the model efficiently.
+Okay, but why a GPU? In this case we're talking about Transformers and there are three main groups of optimizations that it does~\cite{hf2}: Tensor Contractions, Statistical Normalizations, and Element-Wise Operators. Contractions involve matrix-matrix multiplications and are the most computationally challenging part of the transform; statistical normalizations are a mapping and reduction operation; and element-wise operators are things like dropout and biases and these are not very computationally-intensive. We don't need to repeat the reasoning as to why GPUs are good at matrix-matrix multiplication and reduction operations since that's already been discussed. 
+
+In discussing the optimizations we can make, we'll also need to consider what is in memory, since it's possible that our training of a model might be limited by available GPU memory rather than compute time. Things like the number of parameters and temporary buffers count towards this limit.
+
+\paragraph{Optimizing.}There are two kinds of optimizations that are worth talking about. The first one is the idea of model performance: how do we generate a model that gives answers or predictions quickly? The second is how can we generate or train the model efficiently.
 
 The first one is easy to motivate and we have learned numerous techniques that could be applied here. Examples: Use more space to reduce CPU usage, optimize for common cases, speculate, et cetera. Some of these are more fun than others: given a particular question, can you guess what the followup might be? 
 
 Before we get into the subject of how, we should address the question of why you would wish to generate or customize a LLM rather than use an existing one. To start with, you might not want to send your (sensitive) data to a third party for analysis. Still, you can download and use some existing models. So generating a model or refining an existing one may make sense in a situation where you will get better results by creating a more specialized model than the generic one. To illustrate what I mean, ChatGPT will gladly make you a Dungeons \& Dragons campaign setting, but you don't need it to have that capability if you want it to analyze your customer behaviours to find the ones who are most likely to be open to upgrading their plan. That extra capability (parameters) takes up space and computational time and a smaller model that gives better answers is more efficient. 
 
+\subsection*{Techniques}
+Now we can discuss the techniques for optimizing the LLM training and talk about how they map to things that we've already discussed in the course. 
 
 \input{bibliography.tex}
 

diff --git a/lectures/live-coding/L24/dummy_data.py b/lectures/live-coding/L24/dummy_data.py
@@ -0,0 +1,40 @@
+import numpy as np
+from datasets import Dataset
+from pynvml import *
+import torch
+from transformers import AutoModelForSequenceClassification
+from transformers import TrainingArguments, Trainer, logging
+
+print_gpu_utilization()
+torch.ones((1, 1)).to("cuda")
+print_gpu_utilization()
+
+model = AutoModelForSequenceClassification.from_pretrained("bert-large-uncased").to("cuda")
+print_gpu_utilization()
+
+logging.set_verbosity_error()
+
+training_args = TrainingArguments(per_device_train_batch_size=4, **default_args)
+trainer = Trainer(model=model, args=training_args, train_dataset=ds)
+result = trainer.train()
+print_summary(result)
+
+seq_len, dataset_size = 512, 512
+dummy_data = {
+    "input_ids": np.random.randint(100, 30000, (dataset_size, seq_len)),
+    "labels": np.random.randint(0, 1, (dataset_size)),
+}
+ds = Dataset.from_dict(dummy_data)
+ds.set_format("pt")
+
+def print_gpu_utilization():
+    nvmlInit()
+    handle = nvmlDeviceGetHandleByIndex(0)
+    info = nvmlDeviceGetMemoryInfo(handle)
+    print(f"GPU memory occupied: {info.used//1024**2} MB.")
+
+
+def print_summary(result):
+    print(f"Time: {result.metrics['train_runtime']:.2f}")
+    print(f"Samples/second: {result.metrics['train_samples_per_second']:.2f}")
+    print_gpu_utilization()