Skip to content

mrsage-101/URDU-SENTIMENT-ANALYSIS-TOOL

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 

Repository files navigation

URDU-SENTIMENT-ANALYSIS-TOOL

A Natural Language Processing (NLP) pipeline for performing sentiment analysis on Urdu social media posts. This project uses several NLP techniques and machine learning models to classify sentiments in Urdu text with high accuracy.

Project Overview

This project implements a sentiment analysis tool specifically tailored for Urdu text, focusing on social media content. Given the complexities of Urdu language processing, the project includes custom tools for preprocessing, feature extraction, and model training. Using popular libraries like NLTK, Gensim, Scikit-Learn, and Pandas, the project provides a comprehensive solution for Urdu sentiment analysis.

Features

  • Text Preprocessing: Includes tokenization, stopword removal, stemming, and lemmatization customized for Urdu.
  • Feature Extraction: TF-IDF and Word2Vec models for feature representation.
  • N-Gram Analysis: Captures context through n-grams.
  • Sentiment Classification: Logistic Regression classifier for sentiment prediction.
  • Performance Metrics: Evaluation using accuracy, precision, recall, and F1-score.

Installation

Clone the repository and install the required dependencies:

git clone https://github.com/your-username/urdu-sentiment-analysis.git
cd urdu-sentiment-analysis

Usage

To use the sentiment analysis tool, follow these steps:

  1. Data Preparation: Load Urdu social media text data in a structured format.
  2. Run Preprocessing: Use the provided scripts to clean and preprocess the text.
  3. Train Model: Run the training script to build the logistic regression classifier.
  4. Evaluate Model: Evaluate model performance using the metrics provided.

Example command:

python sentiment_analysis.py --input your_data_file.csv

Preprocessing Pipeline

The Urdu text preprocessing pipeline includes:

  • Tokenization: Custom Urdu tokenization.
  • Stopword Removal: Removes common Urdu stopwords.
  • Stemming & Lemmatization: Reduces words to their root forms for better analysis.

Modeling

The tool uses a logistic regression model with TF-IDF and Word2Vec representations. An n-gram analysis is conducted to capture word dependencies and improve model accuracy.

Evaluation

Model performance is evaluated on the following metrics:

  • Accuracy
  • Precision
  • Recall
  • F1-Score

Technologies Used

  • Python
  • NLTK
  • Gensim
  • Scikit-Learn
  • Pandas

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published