Skip to content

inderpalk/diabetes_data-analysis

Repository files navigation

Diabetes Prediction Project

Overview

This project is a supervised machine learning project focused on predicting the likelihood of diabetes in individuals based on certain diagnostic measurements. The dataset used for this project is originally from the National Institute of Diabetes and Digestive and Kidney Diseases.

Table of Contents

  1. Project Description
  2. Dataset
  3. Exploratory Data Analysis (EDA)
  4. Data Preprocessing
  5. Machine Learning Models
  6. Evaluation
  7. Findings
  8. Next Steps

Project Description

In this project, we aim to build a binary classification model to predict whether a patient has diabetes or not. We have explored the dataset, performed data preprocessing, developed machine learning models, and evaluated their performance using various metrics.

Dataset

  • Source: Kaggle Diabetes Dataset
  • Description: The dataset contains several diagnostic measurements such as glucose level, BMI, blood pressure, etc., along with the target variable indicating the presence or absence of diabetes.

Exploratory Data Analysis (EDA)

  • Conducted exploratory data analysis to gain insights into the dataset.
  • Explored missing values, data distributions, correlations, and potential outliers.

Data Preprocessing

  • Handled missing values by imputing with mean values.
  • Scaled and normalized numeric columns.
  • Performed feature engineering and addressed imbalanced data.

Machine Learning Models

  • Developed two models: Logistic Regression and Random Forest.
  • Tuned hyperparameters using grid search or randomized search.
  • Utilized cross-validation to assess model generalization.

Evaluation

  • Evaluated model performance using metrics such as accuracy, precision, recall, F1-score, and ROC-AUC.
  • Compared the performance of Logistic Regression and Random Forest models.

Findings

  • Logistic Regression generally outperformed Random Forest in terms of accuracy, precision, and F1-score.
  • Cross-validation results indicated stable performance for both models.
  • Feature importance analysis was used, primarily for Random Forest.

Next Steps

  • Further hyperparameter tuning and feature selection.
  • Consideration of model interpretability and domain-specific requirements.
  • Deployment of the chosen model for real-world predictions.

About

Supervised Learning Project of Diabetes API

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published