A Natural Language Processing (NLP) pipeline for performing sentiment analysis on Urdu social media posts. This project uses several NLP techniques and machine learning models to classify sentiments in Urdu text with high accuracy.
This project implements a sentiment analysis tool specifically tailored for Urdu text, focusing on social media content. Given the complexities of Urdu language processing, the project includes custom tools for preprocessing, feature extraction, and model training. Using popular libraries like NLTK, Gensim, Scikit-Learn, and Pandas, the project provides a comprehensive solution for Urdu sentiment analysis.
- Text Preprocessing: Includes tokenization, stopword removal, stemming, and lemmatization customized for Urdu.
- Feature Extraction: TF-IDF and Word2Vec models for feature representation.
- N-Gram Analysis: Captures context through n-grams.
- Sentiment Classification: Logistic Regression classifier for sentiment prediction.
- Performance Metrics: Evaluation using accuracy, precision, recall, and F1-score.
Clone the repository and install the required dependencies:
git clone https://github.com/your-username/urdu-sentiment-analysis.git
cd urdu-sentiment-analysis
To use the sentiment analysis tool, follow these steps:
- Data Preparation: Load Urdu social media text data in a structured format.
- Run Preprocessing: Use the provided scripts to clean and preprocess the text.
- Train Model: Run the training script to build the logistic regression classifier.
- Evaluate Model: Evaluate model performance using the metrics provided.
Example command:
python sentiment_analysis.py --input your_data_file.csv
The Urdu text preprocessing pipeline includes:
- Tokenization: Custom Urdu tokenization.
- Stopword Removal: Removes common Urdu stopwords.
- Stemming & Lemmatization: Reduces words to their root forms for better analysis.
The tool uses a logistic regression model with TF-IDF and Word2Vec representations. An n-gram analysis is conducted to capture word dependencies and improve model accuracy.
Model performance is evaluated on the following metrics:
- Accuracy
- Precision
- Recall
- F1-Score
- Python
- NLTK
- Gensim
- Scikit-Learn
- Pandas