- Email is the preferred method of communication. Class mailing list will be created as [email protected]. But, the announcements will be made in DingTalk group chat.
- Course slides: Intro | Regression | SVM/KNN/Tree | SVD/PCA/LDA | Hyperparameter | Neural Network | Graphical Model
- Project: Current | 2021 | 2019 | 2018 | 2017 | 2016
- Past years' exam: 2021 | 2019 (online take-home) | 2018 | 2017 | Exams from Tom Michell's ML course (Carnegie Mellon University)
No | Date | Contents |
---|---|---|
01 | 2.21 Tue | Course overview (Syllabus) | Required software (Python, Github, PyCharm) | Python crash course (Basic, Numpy (Notebook Shorcut Keys), Pandas. Also see Datacamp, CheatSheet) |
02 | 2.24 Fri | PML Ch. 1: Intro (Slides) | Notations, Regression, Weight update (Slides) |
03 | 2.28 Tue | PML Ch. 2: Perceptron, Adaline, Gradient descent, Stochastic Gradient Descent |
04 | 3.03 Fri | PML Ch. 3: Logistic Regression (LR) (Slides) and Support Vector Machine (SVM) (Slides) |
05 | 3.07 Tue | PML Ch. 3: KNN (Slides, Code), Decision Tree (Slides). |
06 | 3.10 Fri | PML Ch. 4: Data Preprocessing, PML Ch. 5: SVD/PCA (Slides) |
07 | 3.14 Tue | PML Ch. 5: LDA (Slides), PML Ch. 6: Bias-Variance, Cross-validation (Slides) |
08 | 3.17 Tue | PML Ch. 6: Hyperparameter tuning, Evaluation Metric, Class imbalance (Slides) |
09 | 3.21 Fri | PML Ch. 7: Ensenble Learning (Slides), Kernel Method (Slides, PML Ch 3, 5) |
10 | 3.24 Fri | PML Ch. 8: Sentiment Analysis (Slides) |
11 | 3.28 Tue | Topics in Finance ML: Recession prediction (Slides), ML in Finance Research (Slides), Collaborative Filtering (Slides) |
12 | 3.31 Fri | Neural Network, Deep Learning, CNN (Slides, PML Ch. 12-15) |
13 | 4.04 Tue | Midterm Exam (Tentative) |
14 | 4.07 Fri | HSBC Guest Lecture [1/4] |
15 | 4.11 Tue | HSBC Guest Lecture [2/4] |
16 | 4.14 Fri | HSBC Guest Lecture [3/4] |
17 | 4.18 Tue | HSBC Guest Lecture [4/4] |
18 | 4.21 Fri | Course Project Presentation (may be scheduled later) |
-
- Register on Github.com and let TA know your ID (by DingTalk). Make sure to user your full real name in your profile. Accept invitation to the PHBS organization from TA.
- Create a designated repository
GITHUB_ID/PHBS_MLF_2021
for your HW and project. TickInitialize this repository with a README
and selectpython
under.gitignore
- Fork PML repository to your repository.
- Create a designated repository
- Install Github Desktop. Then clone the PML repository to your local storage.
- Install Anaconda Python distribution (3.X version, not 2.X version). Anaconda distribution is core Python + useful scientific computation libraries (e.g., numpy, scipy, pandas) + package management system (pip or conda)
- Install PyCharm Community version. (Or Professional version after applying for free student license)
- Send to TA the screenshots of (1) Github Desktop (showing the PML repository) (2) Jupyter Notebook (Anaconda) (3) PyCharm (See my examples: Github Desktop, Anaconda Spyder).
- Register on Github.com and let TA know your ID (by DingTalk). Make sure to user your full real name in your profile. Accept invitation to the PHBS organization from TA.
-
- The goal of this HW is to be familiar with
pandas
package and dataframe. Due to limited time, I cannot cover pandas in class. You need to teach yourself. Remenber that there's many answers to do the task I am asking below. Use your own way. - For this HW, we will use Polish companies bankruptcy data Data Set from UCI Machine Learning Repository. Download the dataset and put the 4th year file (
4year.arff
) in yourYOUR_GITHUB_ID/PHBS_MLF_2021/data/
- I did a basic process of the data (loading to dataframe and creating
bankruptcy
column). See my github - We are going to use the following 4 features:
X1 net profit / total assets
,X2 total liabilities / total assets
,X7 EBIT / total assets
,X10 equity / total assets
, andclass
- Create a new dataframe with only 4 feataures (and and
Bankruptcy
). Properly rename the columns toX1
,X2
,X7
, andX10
- Fill-in the missing values (
nan
) with the column means. (Usepd.fillna()
or See Ch 4 ofPML
) - Find the mean and std of the 4 features among all, bankrupt and still-operating companies (3 groups).
- How many companies satisfy the condition,
X1 < mean(X1) - stdev(X1)
ANDX10 < mean(X10) - std(X10)
? - What is the ratio of the bankrupted companies among the sub-groups above?
- The goal of this HW is to be familiar with
-
- The goal of this HW is to be familiar with the basic classifiers PML Ch 3.
- For this HW, we continue to use Polish companies bankruptcy data Data Set from UCI Machine Learning Repository. Download the dataset and put the 4th year file (
4year.arff
) in yourYOUR_GITHUB_ID/PHBS_MLF_2021/HW2/
- I did a basic process of the data (loading to dataframe, creating
bankruptcy
column, changing column names, filling-inna
values, training-vs-test split, standardizatino, etc). See my github - Select the 2 most important features using LogisticRegression with L1 penalty. (Adjust C until you see 2 features)
- Using the 2 selected features, apply LR / SVM / decision tree. Try your own hyperparameters (C, gamma, tree depth, etc) to maximize the prediction accuracy. (Just try several values. You don't need to show your answer is the maximum.)
- Visualize your classifiers using the
plot_decision_regions
function from PML Ch. 3 - Put your result in
YOUR_GITHUB_ID/PHBS_MLF_2021/HW2/Classifiers.ipynb
-
- The goal of this HW is to be familiar with PCA (feature extraction), grid search, pipeline, etc.
- For this HW, we continue to use Polish companies bankruptcy data Data Set from UCI Machine Learning Repository. Download the dataset and put the 4th year file (
4year.arff
) in yourYOUR_GITHUB_ID/PHBS_MLF_2021/HW3/
- Use the same pre-precessing provided in Set 2 (loading to dataframe, creating
bankruptcy
column, changing column names, filling-inna
values, training-vs-test split, standardizatino, etc). See my github - Extract 3 features using PCA method.
- Using the selected features from above, we are going to apply LR / SVM / decision tree.
- Implement the methods using pipeline. (PML p185)
- Use grid search for finding optimal hyperparameters. (PML p199). In the search, apply 5-fold cross-validation.
- Lectures: Tuesday & Friday 1:30 – 3:20 PM
- Venue: PHBS Building, Room 229
Instructor: Jaehyuk Choi
- Office: PHBS Building, Room 755
- Phone: 86-755-2603-0568
- Email: [email protected]
- Office Hour: Monday 7-9 PM
- Email: [email protected]
- TA Office Hour (Room 213/214): TBA
With the advent of computation power and big data, machine learning (ML) recently became one of the most spotlighted research field in industry and academia. This course provides a broad introduction to ML in theoretical and practical perspectives. Through this course, students will learn the intuition and implementation behind the popular ML methods and gain hands-on experience of using ML software packages such as SK-learn and Tensorflow. This course will also explore the possibility of applying ML to finance and business. Each student is required to complete a final course project. This year, the compliance analytics team in HSBC bank (Gunagzhou) will give 4 guest lectures to demonstrate how ML is developed and shared in banking industry.
This course assumes prior knowkedge in probability/statistics and experience in Python. This course is ideally recommended for those who have taken introductory ML/AI courses from undergraduate program.
- PML (primary textbook): Python Machine Learning 3rd Ed. by Sebastian Raschka.
- Github (PHBS fork)
- ISLR: An Introduction to Statistical Learning (with Applications in R) by James, Witten, Hastie, and Tibshirani
- Python Implementation: PHBS/ISLR-python (PHBS fork)
- Bishop: Pattern Recognition and Machine Learning by Bishop (Microsoft)
- ESL: The Elements of Statistical Learning by Hastie, Tibshirani, and Friedman
- CML: Coursera Machine Learning by Andrew Ng
- DL: Deep Learning by Goodfellow, Bengio, and Courville
- AFML: Advances in financial machine learning by López de Prado
- Attendance 20%, Mid-term exam 30%, Assignments 20%, Course Project 30%
- Attendance: TBA Randomly checked. The score is calculated as
20 – 2x(#of absence)
. Leave request should be made 24 hours before with supporting documents, except for emergency. Job interview/internship cannot be a valid reason for leave. - Mid-term exam: 11.1 Mon. In-class open-book without computer/phone/calculator
- Course project: Data Proposal and Presentation. Group of up to ?? people.
- Attendance: checked randomly. The score is calculated as 20 – 2
x
(#of absence). Leave request should be made 24 hours before with supporting documents, except for emergency. Job interview/internship cannot be a valid reason for leave - Grade in letters (e.g., A+, A-, ... ,D+, D, F). A- or above < 30% and B- or below > 10%.