This repository contains code for a machine learning project focused on detecting fake news. The project utilizes Python and various libraries such as Pandas, NumPy, Matplotlib, NLTK, and Scikit-learn.
Kaggle : https://www.kaggle.com/code/alihassanml/fake-news-detection
Fake news has become a significant issue in today's information age, where misinformation can spread rapidly through various media channels. This project aims to build a machine learning model that can automatically detect fake news articles based on their content.
##Libraries Used:##
- pandas: Used for data manipulation and analysis.
- numpy: Provides support for mathematical functions and operations on arrays.
- matplotlib: A plotting library used for data visualization.
- re: Offers support for regular expressions, helpful for text preprocessing.
- seaborn: Works alongside matplotlib for enhanced data visualization.
- nltk: Natural Language Toolkit, used for text processing tasks such as stopword removal and stemming.
- scikit-learn (sklearn):
- TfidfTransformer and TfidfVectorizer: Used for feature extraction from text data using TF-IDF (Term Frequency-Inverse Document Frequency).
- train_test_split: For splitting the dataset into training and testing sets.
- LogisticRegression: Implements logistic regression, a commonly used classification algorithm.
- accuracy_score: Calculates the accuracy of the model.
The dataset used in this project contains a collection of news articles labeled as either fake or real. It includes various features such as the title, text, and other metadata.
- Data Preprocessing: Text data is cleaned and preprocessed using techniques such as removing stopwords, stemming, and vectorization.
- Feature Engineering: Text features are extracted using TF-IDF (Term Frequency-Inverse Document Frequency) vectorization.
- Model Training: The preprocessed data is split into training and testing sets. A logistic regression model is trained on the training data.
- Model Evaluation: The trained model is evaluated on the test set using accuracy as the performance metric.
Project Workflow:
- Data Loading: The project likely starts with loading a dataset containing both real and fake news articles.
- Data Preprocessing:
- Text Cleaning: Removing unnecessary characters, special symbols, and URLs.
- Tokenization: Splitting the text into individual words or tokens.
- Stopword Removal: Eliminating common words that do not carry significant meaning.
- Stemming: Reducing words to their root form to normalize the text.
- Feature Extraction: Using TF-IDF to convert text data into numerical vectors.
- Model Training: Splitting the data into training and testing sets, then training a logistic regression model on the training data.
- Model Evaluation: Evaluating the trained model's performance using accuracy metrics on the testing set.
- Deployment: After successful evaluation, the model can be deployed to predict fake news on new data.
data/
: Contains the dataset used in the project.notebooks/
: Jupyter notebooks containing code for data preprocessing, model training, and evaluation.scripts/
: Python scripts for various functions and utilities used in the project.README.md
: This file, providing an overview of the project.
To run the project:
- Clone this repository to your local machine:
git clone https://github.com/alihassanml/fake-news-detection.git
- Navigate to the project directory:
cd fake-news-detection
- Install the required dependencies:
pip install -r requirements.txt
- Open and run the Jupyter notebooks in the
notebooks/
directory to explore the project.
Feel free to contribute to this project by opening issues or pull requests.
Replace [Ali Hassan](https://github.com/lihassanml)
with your GitHub profile link. This README provides an overview of your project, its structure, and instructions for usage and contribution. Feel free to customize it further according to your project's specifics.