This project analyzes the salaries of data scientists using a dataset from Kaggle. The analysis includes data cleaning, exploratory data analysis (EDA), hypothesis testing, and linear regression modeling to predict salaries based on various factors.
The dataset used in this analysis is from Kaggle: 2023 Data Scientists Salary. It contains information about data scientists' salaries, job titles, experience levels, company sizes, and more.
analysis.ipynb
: Jupyter notebook containing the entire analysis, including data cleaning, EDA, hypothesis testing, and linear regression modeling.cleaned_ds_salaries.csv
: Cleaned dataset after data preprocessing.coefficients.csv
: Coefficients of the linear regression model.linear_regression_model.joblib
: Trained linear regression model saved usingjoblib
.remote_ratio_job_title.png
: Bar plot showing the distribution of remote work ratios by job title.feature_impact_plot.png
: Bar plot showing the impact of different features on predicted salaries.
- Removing Columns with Dominant Categories: Columns with a single dominant category were removed to ensure quality and reliability.
- Restricting Dataset to USA and USD: The dataset was restricted to entries where the
employee_residence
is the USA and thesalary_currency
is USD. - Adjusting Salaries for Cumulative Inflation: Salaries from previous years were adjusted to 2023 values using the inflation rate in the US.
- Converting Numerical Column
remote_ratio
to Categorical: Theremote_ratio
column was converted into a categorical column with three categories:In office
,Hybrid
, andRemote
.
- Distribution of Numerical Columns: Histograms were plotted for each numerical column.
- Distribution of Categorical Columns: Bar charts were plotted for each categorical column.
- Experience Level vs. Salary: Mood’s Median Test was used to compare salaries across different experience levels.
- Remote Ratio vs. Salary: Mood’s Median Test was used to test the association between remote work proportion and salary.
- Company Size vs. Salary: Kruskal-Wallis Test was used to compare salaries across different company sizes.
- Job Title vs. Remote Ratio: Chi-square Test of Independence was used to determine the association between job title and remote work ratio.
- Job Title vs. Salary: Kruskal-Wallis Test was used to compare salaries across different job titles.
- Encoding Categorical Variables: Categorical variables were encoded for numerical representation.
- Splitting Dataset: The dataset was split into training and testing subsets.
- Training Model: A linear regression model was trained to predict adjusted salaries.
- Evaluating Model: The model's performance was evaluated using Mean Squared Error (MSE) and R-squared metrics.
- Visualizing Results: The impact of different features on predicted salaries was visualized.
- Python 3.12
- Jupyter Notebook
- Pandas
- Matplotlib
- Seaborn
- Scipy
- Scikit-learn
- Joblib
- Scikit-posthocs
- Clone the repository.
- Install the required packages.
pip install -r requirements.txt
- Open
analysis.ipynb
in Jupyter Notebook. - Run the notebook cells sequentially to perform the analysis.
This project provides a comprehensive analysis of data scientists' salaries, including data cleaning, EDA, hypothesis testing, and linear regression modeling. The findings offer insights into the factors affecting salaries and can help in making informed decisions.
This project is licensed under the MIT License. See the LICENSE
file for more details.