A simple yet effective C-based spam email classifier using a Naive Bayes approach.
- Overview
- Features
- How It Works
- File Structure
- Getting Started
- Usage
- Performance
- Future Scope and Improvements
This project implements a basic spam email classifier in C. It uses a Basic Naive Bayes algorithm to categorize emails as either spam or not spam based on the words they contain, completely written in C. The classifier is trained on a dataset of labeled emails, learning the probability of each word appearing in spam and non-spam emails. It then uses this information to predict the class of new emails based on the words they contain. The implementation is lightweight and fast, making it suitable for small to medium-sized datasets.
-
Students new to Programming: This project can be a good starting point for students who are new to programming and want to learn about text classification algorithms.
-
C Programming Enthusiasts: For those who want to explore the capabilities of C programming, this project provides a practical example of implementing a machine learning algorithm in C.
-
Machine Learning Beginners: If you are new to machine learning and want to understand the basics of text classification, this project can help you grasp the concepts of Naive Bayes algorithm.
-
Educators and Trainers: Teachers and trainers can use this project to demonstrate the implementation of a simple machine learning algorithm in C to their students.
- Train on a dataset of labeled emails
- Predict whether new emails are spam or not
- Simple and lightweight implementation in C
- Fast execution with runtime measurement
- Model saving and loading functionality
- Data Loading: Emails are loaded from a text file using the data loader.
- Training: The classifier learns from a set of pre-labeled emails, counting the occurrences of words in spam and non-spam emails.
- Tokenization: Emails are broken down into individual words (tokens).
- Probability Calculation: For each word, the probability of it appearing in spam and non-spam emails is calculated using Laplace smoothing.
- Prediction: New emails are classified by calculating the overall probability of being spam or not spam based on the words they contain.
- Model Persistence: The trained model can be saved to a file and loaded later for predictions without retraining.
main.c
: Main program file containing the entry point and command-line interfacespam_classifier.h
: Header file with function declarations and constants for the spam classifierspam_classifier_impl.c
: Implementation of the spam classifier functionsdata_loader.c
anddata_loader.h
: Functions for loading email data from filesmodel_io.c
andmodel_io.h
: Functions for saving and loading the trained model
- GCC compiler
To compile the project, you can use the following command in the project directory:
./run_project.sh
To train the model and test it on a dataset:
./test_output
This will load the email data, train the model, test it, and save the model to a file.
To use the trained model for predicting on new emails:
./test_output --predict
This will load the saved model and allow you to input emails for classification.
The classifier's performance can be evaluated based on:
- Accuracy: Printed at the end of the test phase, showing the percentage of correctly classified emails.
- Execution Time: Displayed in milliseconds, showing the total time taken for training and testing.
Note: The performance may vary depending on the size and quality of the training dataset, as well as the characteristics of the emails being classified. The current dataset is created by the me and is not a real-world dataset.
- Implement more advanced text processing techniques like TF-IDF, stemming, and stop-word removal.
- Experiment with different probability estimation methods and feature selection techniques like chi-square.
- Add more evaluation metrics like precision, recall, and F1 score.
- Generalize the classifier to handle multiple classes and improve the model persistence functionality.