Gain insights into the job market for data engineers in the USA
Live Preview πΈ Data on Kaggle πͺοΈ Request Feature- Project Overview
- Project Architecture
- Web Scrapinng
- Data Cleaning, EDA and Model Building
- Installation
- References
- Contact
The goal of this data science project is to gain insights into the job market for data engineers in the USA. By analyzing job postings and related data from Glassdoor, the project aims to identify the most in-demand tools, education degrees, and other qualifications required by companies hiring for this role. Additionally, the project seeks to create a model to predict salaries for data engineers based on a variety of factors including location, company industry and rating, education level, and seniority.
The project begins with web scraping weekly job postings posted last week of data engineering roles from Glassdoor in the US. The collected data includes job titles, company names, job locations, job descriptions, salaries, education requirements, and required skills. The data is named like "glassdoor-data-engineer-15-2023.csv" where 15 is the week number the data was scraped in and 2023 is the year, then it's stored locally on data/raw/ folder then it's uploaded to an AWS S3 Bucket containing only the raw uncleaned data. The data is then cleaned and preprocessed to remove irrelevant information and ensure consistency, the duplicates are dropped then it's joined with the initial cleaned data in another S3 Bucket containing only one csv file that contains all the job postings. All of this is automated in a data pipeline using MageAI.
Exploratory data analysis (EDA) is performed on the cleaned data to gain insights into trends and patterns. This includes identifying the most common job titles, the industries and locations with the highest demand, and the most commonly required skills and education degrees. EDA also involves creating visualizations to aid in understanding the data.
After EDA, feature engineering is performed to create new features that may improve the accuracy of the salary prediction model. This includes creating dummy variables for categorical features such as location, education level, and seniority.
The salary prediction model is built using a random forest regressor. Finally, the model is deployed in a web application using Streamlit, allowing users to input their own data and receive a salary prediction based on the model.
I adjusted the web scraper using Selenium to scrape data engineering jobs posted last week from Glassdoor US. The output file is then stored in the "/data/raw" folder under the name of "glassdoor-data-engineer-15-2023.csv" where "15" is the week number where the job was posted and "2023" the year. See code here.
With each job, I obtained the following: Company Name, Job title, Salary Estimate, Job Description, Rating, Job Location, Company Size, Company Founded Date, Type of Ownership, Industry and Sector. The main challenge for this scraping task, was the duplicated job postings, after the 6th page or so the glassdoor website keeps rerendring the first jobs listings, so all the jobs scraped become a duplicates. That's why I came up with the idea to implement a scheduler to run the script once every week to get the latest job listings, and then usin a data pipeline clean and transform the data then joining it with the cleaned dataset stored in aws s3 bucket that contains all non duplicated and cleaned job listings from previous weeks.
Please refer to the respective notebooks (data cleaning, data eda, model buidling).
- Clone the repository:
git clone https://github.com/Hamagistral/DataEngineers-Glassdoor.git
- Install the required packages:
pip install -r requirements.txt
- Change directory to mage-etl:
cd mage-etl
- Launch project :
mage start glassdoor_dataengjobs
- Run pipeline :
mage run glassdoor_dataengjobs glassdoor_dataeng_pipeline
- Change directory to streamlit:
cd streamlit
- Run the app:
streamlit run 01_π΅οΈ_Explore_Data.py
Project inspired by: https://github.com/PlayingNumbers/ds_salary_proj
Scraper Github: https://github.com/arapfaik/scraping-glassdoor-selenium
Scraper Article: https://towardsdatascience.com/selenium-tutorial-scraping-glassdoor-com-in-10-minutes-3d0915c6d905
Mage ETL inspired by: https://youtu.be/WpQECq5Hx9g
Streamlit App inspired by: https://youtu.be/xl0N7tHiwlw