Threat model for machine learning projects #3

ttimbers · 2024-07-12T20:12:53Z

Here I want to brainstorm a list to what are all the potential threats (i.e., where can things go wrong) to a machine learning project? Our checklist need not address all of them, but we should in our literature review describe them all, and identify which our checklist covers. Here's my starting list:

Mismatch of machine learning model choice with respect to the data used for training and evaluation (e.g., linear regression for binomial data)
Data quality issues (e.g., missing data, duplicate data, data anomalies, etc)
Errors in code (e.g., bug in code that leads to data labels being shifted by one, or misnaming a file being written to disk)
Data leakage between training and test set, leading to overfitting (e.g., test data being used to create pre-processing object)
Model stability issues (e.g., a different train-validation split leads to a large change in the model)
Model behaviour/learning issues (e.g., model learns shortcut to predictions that can learn to erroneous prediction in certain cases)
Bias/fairness issues (e.g., model makes different predictions for particular subgroups of observations)

ttimbers · 2024-07-12T22:18:41Z

Reproducibility issues (e.g., model training and prediction outputs are different on different computers or operating systems)
Data drift (e.g., model performance decreases over time in production and the new data appears to being coming from a different distribution than the test data)
Deployment issues (e.g., bug in code that creates an error in the API)
Communication of results/predictions issues and/or user interface issues (this itself is very very broad... and may have to be split out)

H234J · 2024-07-14T17:28:03Z

Few more important potential threats can be as follows:

Poor hyperparameter tuning (selecting wrong learning rate, too small or too high gamma value in SVM, depth of tree in decision trees or number of n-estimators in Random Forest)
Skewed Classes in training set ( This can lead too much training for majority class and less training on minority class)
Model Scalability ( Increase in prediction latency when then inflow velocity is high )

ttimbers · 2024-07-16T20:06:25Z

Thank-you for these suggestions @H234J!

JohnShiuMK · 2024-07-18T21:13:40Z

I am not sure if this is part of the reproducibility issue, or should be separated:

inadequate control or specification of dependency versions

Previously, I encountered situations like a project failed to run, or, the model produced different outputs following upgrades of underlying dependencies.

JohnShiuMK · 2024-07-18T21:19:27Z

another potential mistake:

Small dataset but complicated model (e.g. the training dataset is too small, and/or the model is too complicated given a small amount of data)

tonyshumlh · 2024-07-27T16:46:13Z

Extension of "Mismatch of machine learning model choice with respect to the data used for training and evaluation": improper evaluation metrics, e.g. use accuracy for very imbalanced dataset, or scenario where false positives/negatives have serious consequence.
Model interpretation issue, similar to Model behaviour/learning issues: model prediction puts too much weight on attributes where human expects little or no effect, or vice versa

ttimbers · 2024-10-08T17:22:48Z

Another that we need to add is the E, from ETL (extract, transform and load). I think we have the T & L covered, but not the E.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Threat model for machine learning projects #3

Threat model for machine learning projects #3

ttimbers commented Jul 12, 2024

ttimbers commented Jul 12, 2024

H234J commented Jul 14, 2024

ttimbers commented Jul 16, 2024

JohnShiuMK commented Jul 18, 2024

JohnShiuMK commented Jul 18, 2024

tonyshumlh commented Jul 27, 2024

ttimbers commented Oct 8, 2024

Threat model for machine learning projects #3

Threat model for machine learning projects #3

Comments

ttimbers commented Jul 12, 2024

ttimbers commented Jul 12, 2024

H234J commented Jul 14, 2024

ttimbers commented Jul 16, 2024

JohnShiuMK commented Jul 18, 2024

JohnShiuMK commented Jul 18, 2024

tonyshumlh commented Jul 27, 2024

ttimbers commented Oct 8, 2024