This project is part of a Data Mining course, aiming to create Statements of Purpose (SOPs) for universities using a dataset of universities. The primary objectives of the project include data cleaning, keyword extraction, and generating SOPs. Due to constraints, the original plan of crawling websites for SOPs was abandoned in favor of utilizing OpenAI's GPT-3 with the assistance of an OpenAI API key.
The University SOP Generator project encompasses the following key components:
-
Data Cleaning: The initial dataset of universities requires cleaning to ensure data quality and consistency.
-
Keyword Extraction: Extract relevant keywords from the dataset to be used in the SOPs.
-
SOP Generation: Utilize OpenAI's GPT-3 and an OpenAI API key to generate Statements of Purpose for universities based on the extracted keywords.
-
Website Crawling (Unused): Although the
main.py
file contains code for website crawling using Selenium, this feature was not implemented in the main notebook code due to limitations and cost factors. -
Word Cloud Visualization: Generate word clouds to visualize the most frequent keywords extracted from the dataset.
Before running this project, ensure you have the following:
- Python installed on your system.
- Required Python libraries and dependencies installed (specified in the project's requirements file).
- An OpenAI API key to access the GPT-3 model.
-
Clone this repository or download the source files.
-
Install the necessary Python packages using
pip
or your preferred package manager:pip install -r requirements.txt
-
Configure your OpenAI API key by following the provided instructions.
-
Run the main notebook code to perform data cleaning, keyword extraction, and SOP generation.
-
Optionally, run main.py if you wish to utilize Selenium for website crawling (note that this feature is not integrated into the main notebook code).
-
Execute the main notebook to clean the dataset and generate SOPs.
-
Review the generated SOPs and adjust them as needed.
-
Execute main.py if you decide to use Selenium for website crawling (make sure you've configured the script accordingly).
-
Visualize the extracted keywords using word cloud visualization.
To replicate the project, you can download the universities dataset from the provided files uploaded in this repository.
Mehrnaz Sadeghieh, Helia Ghahraman
Thank you for exploring our University SOP Generator project. We hope this tool assists you in generating Statements of Purpose for universities efficiently and effectively.