FlairifyMe is a Reddit Flair Detector for r/india subreddit, that takes a post's URL as user input and predicts the flair for the post using a model generated by Logistic Regression. The web-application is hosted on Heroku at FlairifyMe(https://flairify-me.herokuapp.com/).
The web-application also offers visual content and temporal analysis of the collected data.
The project has been developed using Python and several of its libraries and frameworks:
- Scikit-learn
- PRAW
- NLTK
- Flask
- numpy
- pandas
- PyMongo
The scraped data is saved and loaded as a MongoDB instance.The web-application is based on Flask, and deployed using Heroku.
Following is the description of the files and folders in the repository:
- Data: Contains CSV files with preprocessed scraped data, the MongoDB Collections and scripts for scraping, and preprocessing and analysing data.
- Models: Contains the machine learning model used for predicting flairs.
- Training: Contains the script for text-classification.
- templates: Contains HTML scripts for the web-application
- app.py: Used to start up the Flask server.
- flair_predictor.py: Module to accept a valid URL and predict the post's flair by loading the model.
- nltk.txt: Contains NLTK library dependencies for deployment on Heroku.
- requirements.txt: Contains all dependencies for the project
The web-application allows the user to enter a r/india URL and displays the predicted flair for the submitted post. The user can view content and temporal analysis of the scraped data by clicking on the 'Post Analysis' button on the top right corner of the page.
To run on a local server:
- Clone the repository
git clone https://github.com/BhavyaC16/FlairifyMe.git
- Create a virtual environment
python3 -m venv FlairifyMe
source FlairifyMe/bin/activate
cd FlairifyMe/
- Finally, install the project dependencies
pip3 install -r requirements.txt
- Create the file
RedditAPI.py
as follows:
def accinfo():
personalScript = '<enter_Reddit_App_personal_script_here>'
secretKey = '<enter_Reddit_App_secret_key_here>'
app = 'FlairifyMe'
username = '<enter_your_Reddit_Username_here>'
password = '<enter_your_Reddit_password>'
return([personalScript,secretKey,app,username,password])
Copy the same file to the directory: ./Data/Scripts/
as well if you want to scrape posts from Reddit.
- To run the server, execute the following command
python3 app.py
The python library PRAW has been used to scrape data from the subreddit r/india, with a total of 3,156 posts for 13 different flairs. The number of posts scraped per flair are as follows:
The data has been preprocessed using the NLTK library. The following procedures have been executed on the title, body and comments to clean the data:
- Tokenizing and removing symbols
- Removing stopwords
- Stemming
Two separate databases have been prepared and saved as a MongoDB instance for training: one with stemming, and the other without stemming, as it is said to reduce prediction accuracy in certain cases by sources.
The data has been loaded from MongoDB to a pandas DataFrame and split into 80-20 Training-Testing sets using scikit-learn. Each of the post features: Title, Body, Comments, Title+Comments and Title+Body+Comments were trained on three algorithms: Naive Bayes, Linear SVM and Logistic Regression, for both datasets(with and without stemming).
Following are the results, summarized as a table:
DATA WITHOUT STEMMING:
Feature\Algorithm | Naive Bayes | Linear SVM | Logistic Regression |
---|---|---|---|
Title | 0.59177 | 0.58386 | 0.54430 |
Body | 0.20569 | 0.24367 | 0.24051 |
Comments | 0.31171 | 0.59494 | 0.58069 |
Title+Comments | 0.37500 | 0.64082 | 0.63449 |
Title+Body+Comments | 0.37816 | 0.64399 | 0.65189 |
DATA WITH STEMMING:
Feature\Algorithm | Naive Bayes | Linear SVM | Logistic Regression |
---|---|---|---|
Title | 0.57753 | 0.57120 | 0.54430 |
Body | 0.18354 | 0.23101 | 0.24051 |
Comments | 0.30063 | 0.55538 | 0.56013 |
Title+Comments | 0.36076 | 0.58703 | 0.60126 |
Title+Body+Comments | 0.36551 | 0.59335 | 0.61392 |
After going through the flair-wise and overall prediction accuracies, the model trained using Title+Body+Comments on non-Stemmed data, using Logistic Regresssion was chosen.
The saved model is loaded for predicting the flair once the post features (title, body and comments) have been cleaned using NLTK. The returned result is displayed on the web-application.
A developer API using flask has been implemented, which returns a JSON containing the predicted flair of the Reddit Post queried by the user.
Can be accessed by querying:
flairify-me.herokuapp.com/api/resource?redditURL=<enter_url_here>
Returns JSON of the following format when successful:
{'status': 'successful', 'status_code': 200, 'result': {'flair': '<predicted_flair>'}}
Else, returns JSON of the format:
{'status': 'failed', 'status_code': <error_code>, 'result': {'error': '<error_message>'}}
I plan on adding the following features to the project:
- Improving the prediction by training the model on user inputs.
- Automating the script to allow users to develop prediction model for any subreddit entered by them.
This task has been a great learning experience for me as it was my first time working with Machine Learning and Natural Language Processing, and with most of the tools like Heroku and MongoDB, as well as several libraries like scikit-learn, nltk, praw and Flask.