The developed tool is capable of the following:
- Output the most occurring words with their frequency based on user preference
- Feature extraction processing and simplifying raw data
- Training the used naive bayes classifier
- Outputting ham/spam confusion matrix
Due to the limited capabilities of the used machine, the used sets for testing and training was a smaller portion of a much bigger dataset. The filter was optimized to work on the smaller dataset, but it also can run the larger one, given the correct number of files in each label vector and identifying spam email in that vector.
Link to the smaller Test-Train dataset used
Link to the [whole 50MB dataset]
- Part 1 - Most Common Words Extraction
- Part 2 - Feature Extraction
- Part 3 - Extracting Labeled Feature Vector per Training Email to One Single Two-Dimensional Matrix
- Part 4 - Defining and Training Naive Bayes Classifier
- Part 5 - Testing the Trained Model using the Test Set Defined