Syllabus: ML_Spring_Syllabus
Lecture 1 (Jan 30):
- Types of Machine Learning: We differentiated between supervised, unsupervised, and reinforcement learning as the main approaches in the field.
- Classification and Regression: Within supervised learning these are the two tasks, where our target is categorical or continuous respectively.
- Supervised Learning with the IRIS Dataset: We demonstrated the k-nearest neighbors algorithm using the IRIS dataset to introduce classification tasks.
- Importance of Data Splitting: We emphasized the need for splitting data into training and testing sets to prevent misleading accuracy scores.
- Model Evaluation Techniques: We introduced train-test split, validation sets, and cross-validation methods for assessing model performance.
- Cross-Validation: We explained how cross-validation helps in estimating model performance more reliably by using different data subsets for training and testing.
Lecture 2 (Feb 1): We introduced the concept of Linear Regression, and used it as an example to illustrate other topics including
- Loss or Objective Functions: The class explored the significance of loss functions, such as the mean squared error, in defining how well a model is performing.
- Gradient Descent: We discussed the gradient descent optimization algorithm, emphasizing its role in minimizing the loss function by iteratively adjusting model parameters.
- Analytical vs. Gradient Descent Solutions: The differences between analytical solutions, like the normal equation, and iterative solutions, like gradient descent, were highlighted, including when and why to use one over the other.
- Feature Engineering and Polynomial Regression: The session covered feature engineering, specifically the creation of polynomial features, and how they enhance linear regression models to fit non-linear data.
- Regularization Techniques: We introduced LASSO and Ridge regression as regularization techniques to prevent overfitting by penalizing large coefficients.
- LASSO (L1) Regularization usually results in some coefficients being set to zero (this is called sparsity)
- Ridge (L2) Regularization discourages coefficients from becoming too big but doesn't necessarily result in sparsity
- Bias-Variance Tradeoff: We discussed "bias-variance tradeoff" between model complexity and generalization ability.
Lecture 3 (Feb 6): We continued to look at Linear Regression, this time focusing on:
- Loss Functions: How they quantify model errors and guide the learning process. We looked at some interactive examples.
- Gradient Descent: How we minimize loss functions whose minima are too hard to calculate explicitly.
- Learning Rate: The importance of choosing the right learning rate for convergence.
- Convexity: Its significance in ensuring we find the global optimum.
- Manual Derivation: Calculating derivatives to understand gradients and looking at basic python implementation.
- Momentum: Introduced to accelerate convergence and navigate complex loss landscapes more effectively.
- Regularization Techniques: Briefly discussed LASSO and Ridge regression to reduce overfitting and, in the case of LASSO, make the model sparse.
Lecture 4 (Feb 8):
- Linear Regression (wrap-up):
- Feature Scaling: Discussed the importance of scaling features to prevent issues with model convergence and numerical stability.
- Model Interpretation: Examined the coefficients of a linear regression model to understand the impact of each feature.
- Feature Engineering: Revisited one-hot encoding of categorical variables, demonstrating its application in our model analysis.
- Decision Trees: Introduced decision trees, clarifying the role of supervised learning and how it applies to model training and deployment.
- We illustrating classification with decision trees on the board, and discussed how a decision tree might go about learning a decision boundary.
- We compared the tree structure to the implications for
decision boundaries
Lecture 5 (Feb 13):
- Classification Problems and Decision Boundaries
- There are infinitely many ways to draw a general boundary
- If we restrict ourselves to rectilinear (boxy) boundaries there are still too many
- Decision Trees greedily choose the best binary split at each point and recursively partition the space
- Decision Trees
- How they're trained
- What do we mean by "best split"?
- For Classification this could be:
- Entropy
- Gini Impurity
- Misclassification Rate
- Using them for Regression
- Typically use MSE
- For Classification this could be:
- Early Stopping
- Max depth
- Min samples at leaf (or at decision node)
- Max tree size (total number of leaves)
- Pruning
- Cost complexity pruning
- What do we mean by "best split"?
- Hyperparameters
- Pros
- Interpretability - Feature Importance
- Fast to train and query
- Insensitive to feature scaling
- Cons
- Tends to overfit
- Non-robust
- Bad at extrapolating out of training distribution
- How they're trained
- Ensemble Methods for Decision Trees
Lecture 6 (Feb 15):
- Reiterated concepts from last class including:
- How trees are structured
- Finding the best decision to reduce loss
- Random Forests as Bootstrap Aggregating + Randomized Feature Selection
- Gini Impurity: meaning and calculation
- Looked at concrete example with Titanic dataset:
- Explored issue leading to alleged 100% accuracy (including the target in the features)
- Explored why the first question used the "gender" feature
- Ran though the process of predicting unseen data
Lecture 7 (Feb 20):
- Node Importance and Feature Importance
- Introduced 8x8 digits classification:
- Looked at the dependence of the 'test_loss' upon number of trees
- Looked at the dependence of the 'test_loss' upon max depth of the trees
- Looked at the gap between the 'test_loss' and 'train_loss' which (when large) is a case overfitting
- Gradient Boosting
- AdaBoost
- Adds weights to both the samples and to the estimators:
- The weight of a sample is increased if the ensemble prediction error for that sample is bad
- The weight of each estimator is determined by its performance, anti-predictors can be given negative weight
- Adds weights to both the samples and to the estimators:
- XGBoost: The model with all the bells and whistles, the best overall but with lots of hyperparameters to tune. Can underperform simpler models on smaller datasets.
- CatBoost: Specialized in handling categorical variables.
- LightGBM: Lightweight model which works well on big datasets
or when efficiency is needed (for example: in mobile applications)
Lecture 8 (Feb 22):
-
Boosting Methods Summary
-
Quiz 1
-
Review of quiz questions
-
Support Vector Machines
- Decision boundary is a straight line, plane, or hyperplane depending on if you're in 2, 3, or 4+ dimensions respectively
- Specifically we pick the boundary with the "maximum margin" (the widest street)
-
Digression into Linear Algebra: Dot Products
Lecture 9 (Feb 27):
-
Getting non-linear boundaries with feature maps
-
"Loss function"
-
Dual form of loss function
-
The "Kernel Trick"
- Feature maps revisited (reducing computational complexity)
- Polynomial Kernel
- Gaussian Kernel (aka. Radial Basis Functions (RBF))
Lecture 10 (Feb 29):
-
Interactive SVM fitting example
-
Hard vs Soft margin SVMs
- Slack parameter
- "Hinge Loss" formulation
-
Gaussian Kernel
-
Face Classification Example
Lecture 11 (Mar 5):
-
Example: Binary Classification of College Acceptance based upon SAT Score
- Looking at p(Acceptance | SAT Score) we see that the curve is s-shaped.
- We want to model it in a principled way
- We look at the odds (p(Accepted)/p(Rejected)) and want to view it on a log scale.
- Looking at the log-odds we see that it is linear!
- This gives us the idea to model the log-odds with a linear function. This corresponds to modeling p(Acceptance | SAT Score) with a "sigmoid" function
-
Logistic Regression
- Model p(y|x) as ![LaTeX: p(y|x) = \frac{e^{wx+b}}{1+e^{wx+b}}](https://ncf.instructure.com/equation_images/p(y%257Cx)%252
0%253D%2520%255Cfrac%257Be%255E%257Bwx%252Bb%257D%257D%257B1%252Be%255E%257Bwx%252Bb%257D%257D?scale=1)
- Train by maximizing the "Likelihood" of the data according to the model In practice we do this by minimizing the negative log of the Likelihood (nll) This is the "famous" Binary Cross Entropy (BCE) loss
Lecture 12 (Mar 7):
-
Model Evaluation and Selection
- Confusion Matrix
- Precision, Recall, F1 score,
- Specificity and Sensitivity
-
Example with unbalanced data
-
ROC curves, and AUC (Self study: Didn't fit into class)
Lecture 13 (Mar 12):
-
Followup on Last Class: ROC curves:
- Selecting thresholds with ROC curves
- Evaluating binary classifiers with Area Under Curve (AUC)
- Diagonal line (AUC=0.5) is a random classifier which is totally uninformative
- Perfect classifier (AUC=1.0)
- Interactive example
-
Techniques for handling with unbalanced data:
- Using appropriate model evaluation metrics
- Precision, Recall, F1, and AUC
- Choosing a threshold appropriately
- Undersampling
- Random Undersampling
- Prototyping Methods (Clustering)
- Oversampling
- Random Oversampling
- SMOTE
- Class Weighted Learning
- Ensemble Methods
- Balanced Random Forest
- AdaBoost
- Data Augmentation
- Anomaly Detection
- Using appropriate model evaluation metrics
-
Neural Networks:
- Biological Inspiration and the Perceptron
- Quiz 2
- Neural Networks:
- Biological Inspiration and the Perceptron
- Interpreting Perceptrons as Boolean functions
- Interpreting Perceptrons as linear models
- Examining the perceptron decision boundary (it's just a straight line)
- Multi-Level Perceptrons (MLPs)
- Example: Construction an MLP with a triangular decision boundary.
π SPRING BREAK π
Lecture 15 (Mar 26):
-
Neural Networks and MLPs
- MLPs are a specific type of Neural Net: (Namely, fully connected, feed forward NNs as opposed to something like a CNN)
-
Architecture
- Input, Hidden, and Output Layers
- Number of layers and layer sizes are hyperparameters
-
Activation
Function
-
Each neuron computes a sum: and then passes it though an activation function
-
Common choices include: sigmoid, tanh, linear, ReLU, LeakyReLU...
-
The activation function of the output layer needs to be chosen with particular care to suit the problem. (ie, not choosing sigmoid for a regression problem where the network outputs need to be greater than 1)
-
If we use a linear activation function for our hidden layers it is equivalent to not having any hidden layers! We need non-linearity!
-
Linear activation still has a place though, it is often used for the output layer in regression problems (if the range of the output needs to be all real numbers )
-
Gradient Descent requires that we compute derivatives of the loss with respect to each parameter
- Back-propagation is how we efficiently find these derivatives
- This shows us the critical importance of the derivative of the activation function:
- It must not be 0 all the time or the network will never learn: This means we can't use a "step" activation function
- It should be easy to compute: This motivates ReLU
-
https://stanford.edu/~shervine/teaching/cs-229/cheatsheet-deep-learninghttps://stanford.edu/~shervine/teaching/cs-229/cheatsheet-deep-learning
Lecture 16 (Mar 28):
-
Gradient Descent
- Perform updates to the parameters:
- Computing the gradient of the loss function which involves taking a sum over the whole dataset. Full-batch gradient descent does this.
- This is computationally expensive but very stable and converges more quickly
- In Pure Stochastic Gradient Descent (SGD) you update only evaluate the loss of using one sample at a time.
- This is computationally very cheap but is less stable and does not converge quickly (or at all)
- Stability is not always desirable, as it can mean we get suck in local minima more easily
- Dealing with lots of data with Batches
- Mini-batches
-
MLPs
- What defines them?
- Architecture
- Activation Functions
- Parameter values (weights and biases)
- This is the secret sauce! We discussed the leak of Meta's LLaMA model which Meta likely spent on the order of $1,000,000 training.
- Formulating MLPs with Matrix Multiplication
- Counting parameters: one weight for each connection between neurons and one bias for each neuron.
- What defines them?
-
Review
Lecture 17 (Apr 2):
Midterm Exam
Lecture 18 (Apr 4):
-
Exam Post-Mortem
-
Suggested reading ("Guessing the Teacher's Password", or what it means to actually learn something): [https://www.lesswrong.com/posts/NMoLJuDJEms7Ku9XS/guessing-the
-teacher-s-password](https://www.lesswrong.com/posts/NMoLJuDJEms7Ku9XS/guessing-the-teacher-s-password)
Lecture 19 (Apr 9):
- Multi-Class Classification
- Softmax output
- Use Cross Entropy
Lecture 20 (Apr 11):
- Hands on creation and training of Neural Networks
Lecture 21 (Apr 16):
-
Convolutional Neural Networks
- Conv2D Layer
- Max Pooling Layer
-
Excellent "cheat sheet" resource: https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-convolutional-neural-networks
Architecture of "AlexNET"
Lecture 22 (Apr 18):
-
Regularization Techniques
- L2 Regularization
- Model Checkpoint
- Early Stopping
- Dropout (Original paper: https://jmlr.org/papers/volume15/srivastava14a/srivastava14a.pdf)
- Batch Normalization (Original paper: https://arxiv.org/abs/1502.03167)
- Different weight initialization schemes
- Data Augmentation
- Ensemble Methods (Bagging)
-
Dimensionality Reduction
- Reasons for doing it include:
- Visualization
- Feature Selection
- Noise Reduction
- Cluster Analysis
- Compression
- Interpretability
- Anomaly Detection
- Manifold Learning
- etc...
- Some Methods for Dimensionality Reduction
- Principle Component Analysis (PCA)
- Autoencoders
- Ones we didn't have time to talk about include:
- t-SNE
- Linear Discriminant Analysis (LDA)
- Singular Value Decomposition (SVD)
- Non-negative Matrix Factorization (NMF)
- Note: All of these techniques (except Autoencoders) are basically just linear algebra ideas which can be adapted for dimensionality reduction
- Reasons for doing it include:
-
Use of Autoencoders for De-noising
-
Use of Autoencoders for Anomaly Detections
- Use of the Encoder from an Autoencoder for Generating new data
- This is our first real example of Generative AI
- Compared to thispersondoesnotexist.com (which uses a GAN not an Autoencoder)
- A more common approach is to use a variant of Autoencoders called a Variational Autoencoder (VAE).
- Learn more about VAEs here: https://www.youtube.com/watch?v=9zKuYvjFFS8&t=447s
Lecture 23 (Apr 23):
Clustering:
- K Means Clustering
- DB-Scan
Lecture 24 (Apr 25):
-
Review of HW5 submitted models
- Effect of Model Scale
- Image Augmentation
- Improvement upon Fully Connected network (MLP) baseline
- Discussion of L1 and L2 normalization
-
Brief discussion of Generative Adversarial Networks (GANs):
- Uses an NN as a generator of images, and then uses another NN trained to discriminate real from fake images as part of its loss function
- I recommend watching this video to learn some more advanced ideas about GANs: https://www.youtube.com/watch?v=dCKbRCUyop8&t=1067s
Tentative future schedule:
Lecture 25 (Apr 30):
- Bayes' Theorem
- Prior and Posterior Probabilities
- Using Bayes' Theorem to "update" prior beliefs and obtain "posterior" probabilities
- Example of a test for a disease
- Implications for Human Rationality: Bayesian Rationality movement (see: [lesswrong.com](http://lesswrong.com
))
-
The simple "odds formulation" of Bayes' Theorem:
- Posterior Odds = "Likelihood Ratio" x Prior Odds
-
Naive Bayes Classifiers
- If we assume each feature is independent we can treat them as separate pieces of evidence and use the math of multiple updates
- Example of a spam filter
- Gaussian Naive Bayes
Lecture 26 (May 2):
- Gaussian Mixture Models
- Unsupervised Learning Wrap Up
- Natural Language Processing
Lecture 27 (May 7):
Quiz 3
-
LLMs
- How they work: Next Token Prediction
- Transformers
- Context size, temperature, and other ideas
- RLHF
- Prompting
-
Foundation Models
Lecture 28 (May 9): (Last Lecture)
- How language models work:
- Word embeddings as trainable parameters
- We represent language as a sequence of vectors, one for each word (or technically for each token)
- Early "Neural Language Models" date back to 1991
- These used Recurrent Neural Networks (RNNs)
- In RNNs some neurons connect back upon themselves in successive time steps
- Each word has its embedding fed in to the network sequentially and they are trained to predict the next word in a process known as "Self Supervised Learning"
- We discussed the "Vanishing and Exploding Gradients" problem which effectively limits the length of sequence which the networks can process
- Attempts to improve upon RNNs include LSTMs and GRUs
- In 2017 a team at Google Brain invented the "Transformer" in their landmark paper: Attention is All You Need
- Transformers radically improve upon the previous state of the art, and at their hear is the "Attention Mechanism"
- In an "Attention Head" each word's embedding is mapped to three vectors:
- Query: Which essentially "asks a question" in the form of a vector
- Key: Which says "I am relevant to these questions" in the form of a vector
- Value: Which gives an "answer to the question" in the form of another vector
- We can explore how this works in an example: Imagine the sentence "A fuzzy creature roamed the verdant forest"
- One attention head may do something like this:
- All nouns produce a Query: if there are any adjectives preceding them,
- all adjectives produce a **Key: **I am an adjective!
- and all adjectives produce a Value: Here is what I describe
- Lets focus on what happens to the embedding of "creature" in this attention head:
- creature's Query should line up with fuzzy's Key, in the sense that they have a large dot product.
- We say "creature" will attend to "fuzzy", and it will add part of the Value of "fuzzy" to its own embedding, enriching its meaning.
- Attention is All You Need also introduced a method for "positional encoding" which augments the embedding vectors with information about where they appear in the sequence. In an RNN this information is contained in the order in which you feed the words in. In a Transformer the whole sequence is processed in parallel, so this information would otherwise be lost.
- This is needed for the operation we just described, as part of the query will "ask" about relative distance by including a linear transformation of its own positional encoding representing a certain shift, and the key will learn to answer with its own positional encoding.
- One attention head may do something like this:
- Transformers combine many of these attention heads in parallel to form a "multi-headed attention" block, and alternate those blocks with MLP layers and stack them a large number of times.
- In this fashion the network can learn complicated relationships which require comparisons of many different parts of the input sequence.
- Note that long-range dependences are effortless for the transformer to model as inside the attention head each position in the sequence is able to attend to any other position.
- This means however that the complexity scales with the length of the sequence, which naturally limits the longest sequences our models can process; this is known as the "context size" of the model. Earlier GPT-3 models had context sizes of ~2k tokens (about half the length of this entire page^) and thus would eventually forget the beginnings of long conversations. Pushing the models to larger scale has dramatically increased the context size of the state of the art: GPT-4 now supports up to 128k tokens, which is roughly 240 pages of text.