Data Scientist Salaries Analysis

Overview

This project analyzes the salaries of data scientists using a dataset from Kaggle. The analysis includes data cleaning, exploratory data analysis (EDA), hypothesis testing, and linear regression modeling to predict salaries based on various factors.

Dataset

The dataset used in this analysis is from Kaggle: 2023 Data Scientists Salary. It contains information about data scientists' salaries, job titles, experience levels, company sizes, and more.

Project Structure

analysis.ipynb: Jupyter notebook containing the entire analysis, including data cleaning, EDA, hypothesis testing, and linear regression modeling.
cleaned_ds_salaries.csv: Cleaned dataset after data preprocessing.
coefficients.csv: Coefficients of the linear regression model.
linear_regression_model.joblib: Trained linear regression model saved using joblib.
remote_ratio_job_title.png: Bar plot showing the distribution of remote work ratios by job title.
feature_impact_plot.png: Bar plot showing the impact of different features on predicted salaries.

Data Cleaning

Steps

Removing Columns with Dominant Categories: Columns with a single dominant category were removed to ensure quality and reliability.
Restricting Dataset to USA and USD: The dataset was restricted to entries where the employee_residence is the USA and the salary_currency is USD.
Adjusting Salaries for Cumulative Inflation: Salaries from previous years were adjusted to 2023 values using the inflation rate in the US.
Converting Numerical Column remote_ratio to Categorical: The remote_ratio column was converted into a categorical column with three categories: In office, Hybrid, and Remote.

Exploratory Data Analysis (EDA)

Steps

Distribution of Numerical Columns: Histograms were plotted for each numerical column.
Distribution of Categorical Columns: Bar charts were plotted for each categorical column.

Hypothesis Testing

Hypotheses and Tests

Experience Level vs. Salary: Mood’s Median Test was used to compare salaries across different experience levels.
Remote Ratio vs. Salary: Mood’s Median Test was used to test the association between remote work proportion and salary.
Company Size vs. Salary: Kruskal-Wallis Test was used to compare salaries across different company sizes.
Job Title vs. Remote Ratio: Chi-square Test of Independence was used to determine the association between job title and remote work ratio.
Job Title vs. Salary: Kruskal-Wallis Test was used to compare salaries across different job titles.

Linear Regression Analysis

Steps

Encoding Categorical Variables: Categorical variables were encoded for numerical representation.
Splitting Dataset: The dataset was split into training and testing subsets.
Training Model: A linear regression model was trained to predict adjusted salaries.
Evaluating Model: The model's performance was evaluated using Mean Squared Error (MSE) and R-squared metrics.
Visualizing Results: The impact of different features on predicted salaries was visualized.

Requirements

Python 3.12
Jupyter Notebook
Pandas
Matplotlib
Seaborn
Scipy
Scikit-learn
Joblib
Scikit-posthocs

Running the Analysis

Clone the repository.
Install the required packages. pip install -r requirements.txt
Open analysis.ipynb in Jupyter Notebook.
Run the notebook cells sequentially to perform the analysis.

Conclusion

This project provides a comprehensive analysis of data scientists' salaries, including data cleaning, EDA, hypothesis testing, and linear regression modeling. The findings offer insights into the factors affecting salaries and can help in making informed decisions.

License

This project is licensed under the MIT License. See the LICENSE file for more details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Scientist Salaries Analysis

Overview

Dataset

Project Structure

Data Cleaning

Steps

Exploratory Data Analysis (EDA)

Steps

Hypothesis Testing

Hypotheses and Tests

Linear Regression Analysis

Steps

Requirements

Running the Analysis

Conclusion

License

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
Poster_Latex		Poster_Latex
README.md		README.md
analysis.ipynb		analysis.ipynb
cleaned_ds_salaries.csv		cleaned_ds_salaries.csv
coefficients.csv		coefficients.csv
company_size.png		company_size.png
ds_salaries.csv		ds_salaries.csv
experience_level.png		experience_level.png
feature_impact_plot.png		feature_impact_plot.png
job_title.png		job_title.png
linear_regression_model.joblib		linear_regression_model.joblib
number_of_titles.png		number_of_titles.png
remote_ratio.png		remote_ratio.png
remote_ratio_job_title.png		remote_ratio_job_title.png
requirements.txt		requirements.txt

geoburdin/DataScientistSalary

Folders and files

Latest commit

History

Repository files navigation

Data Scientist Salaries Analysis

Overview

Dataset

Project Structure

Data Cleaning

Steps

Exploratory Data Analysis (EDA)

Steps

Hypothesis Testing

Hypotheses and Tests

Linear Regression Analysis

Steps

Requirements

Running the Analysis

Conclusion

License

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages