Anomaly Detection (also known as Outlier Detection) is the process of recognizing objects that are different from normal expectations. When we compare anomaly and noise data there have some differences. Noise data is far away from the mean or median in a distribution. Whereas, the anomaly is generated by a different process than whatever generated the rest of the data. When huge data needs to be processed in near real-time to gain insight, streaming data is the best answer. Analyzing this data could provide us valuable insight for future actions.
In this repository was written Machine Learning based Real-time Network Anomaly Detection project using Spark Streaming. Decision Tree, Random Forest, Gradient Boost Tree, Naive Bayes, and Logistic Regression algorithms were used for supervised machine learning. K-means clustering algorithm was used for unsupervised machine learning.
Dataset: The improved version of KDD which is NSL-KDD dataset was used for experiments. This data set has 41 attributes and 42nd is the class label which is assigned as normal or anomaly. Principal Component Analysis (PCA) technique was used for feature reduction.
- Decistion Tree
- Random Forest Tree
- Gradient Boost Tree
- Naive Bayes
- Logistic Regression
- K-Means
- Principal Component Analysis (PCA)
In Supervised learning, data is firstly prepared for ML algorithms. Secondly, the model object is created from one of supervised algorithms and model object is trained by batch data. For Streaming process, it works according to the trained model. When comes new data, it is parsed and applied feature reduction. Then data is predicted using one of the Supervised algorithms and Sliding Window. Finally, the results are printed in the console screen as a confusion matrix.
In unsupervised learning, data is firstly prepared for clustering K-Means algorithm. Secondly, k is chosen randomly and after prediction (clustering) Silhouette score for best K value is calculated. The feature distance is calculated according to their cluster for every feature. The Max Distance of Normal values for every cluster is calculated. For Streaming process when new data comes, the cluster is predicted from the trained model and distance is calculated. Sliding Windows has used to process data in 3-seconds window and 1-second slide. After that, max distance and feature distance are compared. If feature distance is bigger than max distance, the system assigns the value as an anomaly. Otherwise, it is assigned as normal. Finally, the results are printed in the console screen as a confusion matrix.
Gradient Boost Tree Streaming Results
K-Means Streaming Results
Logistic Regression Streaming Results
Keywords: Spark Streaming, Real-Time, Network, Anomaly Detection, Machine Learning, Supervised Learning, Unsupervised Learning, Classifiction, Clustering