C-based Spam Email Classifier

A simple yet effective C-based spam email classifier using a Naive Bayes approach.

Overview

This project implements a basic spam email classifier in C. It uses a Basic Naive Bayes algorithm to categorize emails as either spam or not spam based on the words they contain, completely written in C. The classifier is trained on a dataset of labeled emails, learning the probability of each word appearing in spam and non-spam emails. It then uses this information to predict the class of new emails based on the words they contain. The implementation is lightweight and fast, making it suitable for small to medium-sized datasets.

Target Audience

Students new to Programming: This project can be a good starting point for students who are new to programming and want to learn about text classification algorithms.
C Programming Enthusiasts: For those who want to explore the capabilities of C programming, this project provides a practical example of implementing a machine learning algorithm in C.
Machine Learning Beginners: If you are new to machine learning and want to understand the basics of text classification, this project can help you grasp the concepts of Naive Bayes algorithm.
Educators and Trainers: Teachers and trainers can use this project to demonstrate the implementation of a simple machine learning algorithm in C to their students.

Features

Train on a dataset of labeled emails
Predict whether new emails are spam or not
Simple and lightweight implementation in C
Fast execution with runtime measurement
Model saving and loading functionality

How It Works

If you like to read, here is the explanation: :)

Data Loading: Emails are loaded from a text file using the data loader.
Training: The classifier learns from a set of pre-labeled emails, counting the occurrences of words in spam and non-spam emails.
Tokenization: Emails are broken down into individual words (tokens).
Probability Calculation: For each word, the probability of it appearing in spam and non-spam emails is calculated using Laplace smoothing.
Prediction: New emails are classified by calculating the overall probability of being spam or not spam based on the words they contain.
Model Persistence: The trained model can be saved to a file and loaded later for predictions without retraining.

File Structure

main.c: Main program file containing the entry point and command-line interface
spam_classifier.h: Header file with function declarations and constants for the spam classifier
spam_classifier_impl.c: Implementation of the spam classifier functions
data_loader.c and data_loader.h: Functions for loading email data from files
model_io.c and model_io.h: Functions for saving and loading the trained model

Getting Started

Prerequisites

GCC compiler

Compilation

To compile the project, you can use the following command in the project directory:

./run_project.sh

Usage

Training and Testing

To train the model and test it on a dataset:

./test_output

This will load the email data, train the model, test it, and save the model to a file.

Predicting

To use the trained model for predicting on new emails:

./test_output --predict

This will load the saved model and allow you to input emails for classification.

Performance

The classifier's performance can be evaluated based on:

Accuracy: Printed at the end of the test phase, showing the percentage of correctly classified emails.
Execution Time: Displayed in milliseconds, showing the total time taken for training and testing.

Time Taken Graph for Training and Testing vs Dataset Size

Accuracy vs Dataset Size

Note: The performance may vary depending on the size and quality of the training dataset, as well as the characteristics of the emails being classified. The current dataset is created by the me and is not a real-world dataset.

Future Scope and Improvements

Implement more advanced text processing techniques like TF-IDF, stemming, and stop-word removal.
Experiment with different probability estimation methods and feature selection techniques like chi-square.
Add more evaluation metrics like precision, recall, and F1 score.
Generalize the classifier to handle multiple classes and improve the model persistence functionality.

Name		Name	Last commit message	Last commit date
Latest commit History 78 Commits
images		images
.gitignore		.gitignore
README.md		README.md
data_loader.c		data_loader.c
data_loader.h		data_loader.h
main.c		main.c
model_io.c		model_io.c
model_io.h		model_io.h
run_project.sh		run_project.sh
spam_classifier.c		spam_classifier.c
spam_classifier.h		spam_classifier.h
spm_database.txt		spm_database.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

C-based Spam Email Classifier

Table of Contents

Overview

Target Audience

Features

How It Works

If you like to read, here is the explanation: :)

File Structure

Getting Started

Prerequisites

Compilation

Usage

Training and Testing

Predicting

Performance

Time Taken Graph for Training and Testing vs Dataset Size

Accuracy vs Dataset Size

Future Scope and Improvements

About

Releases

Packages

Languages

harshpreet931/Spam-Email-Classification

Folders and files

Latest commit

History

Repository files navigation

C-based Spam Email Classifier

Table of Contents

Overview

Target Audience

Features

How It Works

If you like to read, here is the explanation: :)

File Structure

Getting Started

Prerequisites

Compilation

Usage

Training and Testing

Predicting

Performance

Time Taken Graph for Training and Testing vs Dataset Size

Accuracy vs Dataset Size

Future Scope and Improvements

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages