Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Nov;67(11):2446-2458.
doi: 10.1007/s00125-024-06248-8. Epub 2024 Aug 21.

Machine learning-based reproducible prediction of type 2 diabetes subtypes

Affiliations

Machine learning-based reproducible prediction of type 2 diabetes subtypes

Hayato Tanabe et al. Diabetologia. 2024 Nov.

Abstract

Aims/hypothesis: Clustering-based subclassification of type 2 diabetes, which reflects pathophysiology and genetic predisposition, is a promising approach for providing personalised and effective therapeutic strategies. Ahlqvist's classification is currently the most vigorously validated method because of its superior ability to predict diabetes complications but it does not have strong consistency over time and requires HOMA2 indices, which are not routinely available in clinical practice and standard cohort studies. We developed a machine learning (ML) model to classify individuals with type 2 diabetes into Ahlqvist's subtypes consistently over time.

Methods: Cohort 1 dataset comprised 619 Japanese individuals with type 2 diabetes who were divided into training and test sets for ML models in a 7:3 ratio. Cohort 2 dataset, comprising 597 individuals with type 2 diabetes, was used for external validation. Participants were pre-labelled (T2Dkmeans) by unsupervised k-means clustering based on Ahlqvist's variables (age at diagnosis, BMI, HbA1c, HOMA2-B and HOMA2-IR) to four subtypes: severe insulin-deficient diabetes (SIDD), severe insulin-resistant diabetes (SIRD), mild obesity-related diabetes (MOD) and mild age-related diabetes (MARD). We adopted 15 variables for a multiclass classification random forest (RF) algorithm to predict type 2 diabetes subtypes (T2DRF15). The proximity matrix computed by RF was visualised using a uniform manifold approximation and projection. Finally, we used a putative subset with missing insulin-related variables to test the predictive performance of the validation cohort, consistency of subtypes over time and prediction ability of diabetes complications.

Results: T2DRF15 demonstrated a 94% accuracy for predicting T2Dkmeans type 2 diabetes subtypes (AUCs ≥0.99 and F1 score [an indicator calculated by harmonic mean from precision and recall] ≥0.9) and retained the predictive performance in the external validation cohort (86.3%). T2DRF15 showed an accuracy of 82.9% for detecting T2Dkmeans, also in a putative subset with missing insulin-related variables, when used with an imputation algorithm. In Kaplan-Meier analysis, the diabetes clusters of T2DRF15 demonstrated distinct accumulation risks of diabetic retinopathy in SIDD and that of chronic kidney disease in SIRD during a median observation period of 11.6 (4.5-18.3) years, similarly to the subtypes using T2Dkmeans. The predictive accuracy was improved after excluding individuals with low predictive probability, who were categorised as an 'undecidable' cluster. T2DRF15, after excluding undecidable individuals, showed higher consistency (100% for SIDD, 68.6% for SIRD, 94.4% for MOD and 97.9% for MARD) than T2Dkmeans.

Conclusions/interpretation: The new ML model for predicting Ahlqvist's subtypes of type 2 diabetes has great potential for application in clinical practice and cohort studies because it can classify individuals with missing HOMA2 indices and predict glycaemic control, diabetic complications and treatment outcomes with long-term consistency by using readily available variables. Future studies are needed to assess whether our approach is applicable to research and/or clinical practice in multiethnic populations.

Keywords: Clustering; Diabetes subtypes; Machine learning; Random forest; Type 2 diabetes.

PubMed Disclaimer

Figures

Fig. 1
Fig. 1
Predictive performance of type 2 diabetes subtypes using an RF algorithm based on 15 features (T2DRF15) for estimating T2Dkmeans in the test dataset of Cohort 1. (a) ROC curve showing the diagnostic performance of T2DRF15, the RF model using Boruta-selected 15 features, to predict the T2Dkmeans. (b) Feature importance of Boruta-selected 15 variables fed into T2DRF15. ALT, aspartate aminotransferase; FPG, fasting plasma glucose; γGT, γ-glutamyl transferase; HDL-C, HDL-cholesterol; TG, triacylglycerols
Fig. 2
Fig. 2
Predictive performance of type 2 diabetes subtypes using an RF algorithm based on 15 features (T2DRF15) for estimating T2Dkmeans in the external validation dataset of Cohort 2. ROC curve showing the diagnostic ability of T2DRF15 to predict the subtypes pre-labelled by k-means clustering (T2Dkmeans) were calculated in original Cohort 2 dataset (a) and in a putative Cohort 2 dataset with missing insulin-related variables (b)
Fig. 3
Fig. 3
Proximity matrix representing the similarity between participants calculated using the RF. (a) Two-dimensional visualisation of the proximity matrix between all participants included in the training and test data. Colours indicate differences in subtype assignment using k-means clustering (T2Dkmeans). (b) Proximity matrix with embedded labels for type 2 diabetes subtypes predicted by the RF algorithm based on 15 variables (T2DRF15). Participants with low predictive probability were newly defined as the undecidable cluster. (c) Proximity matrix with T2DRF15 labels embedded after excluding participants in the undecidable cluster; the remaining participants could be clearly divided into four clusters
Fig. 4
Fig. 4
Sankey diagram showing the subtype redistribution and migration pattern of the study participants in Cohort 1 from baseline to 5 year follow-up. (a) Type 2 diabetes subtypes labelled by k-means clustering (T2Dkmeans). (b) Type 2 diabetes subtypes predicted by an RF algorithm based on 15 variables (T2DRF15). (c) Migration pattern of T2DRF15 excluding the undecidable cluster. (d) Type 2 diabetes subtypes predicted by an RF algorithm based on 15 variables from the dataset where insulin-related variables have been imputed (T2DRF15)
Fig. 5
Fig. 5
Kaplan–Meier curves for the cumulative incidence of retinopathy (a), CKD (eGFR <60 ml/min per 1.73 m2) (b), proteinuria (c) and coronary artery disease (d) in type 2 diabetes subtypes predicted by RF based on 15 variables (T2DRF15) in the putative dataset in Cohort 1 with missing insulin-related variables

Similar articles

References

    1. ElSayed NA, Aleppo G, Aroda VR et al (2023) Classification and diagnosis of diabetes: standards of care in diabetes-2023. Diabetes Care 46(Suppl 1):S19-s40. 10.2337/dc23-S002 - PMC - PubMed
    1. Redondo MJ, Hagopian WA, Oram R et al (2020) The clinical consequences of heterogeneity within and between different diabetes types. Diabetologia 63(10):2040–2048. 10.1007/s00125-020-05211-7 - PMC - PubMed
    1. Inzucchi SE, Bergenstal RM, Buse JB et al (2012) Management of hyperglycemia in type 2 diabetes: a patient-centered approach: position statement of the American Diabetes Association (ADA) and the European Association for the Study of Diabetes (EASD). Diabetes Care 35(6):1364–1379. 10.2337/dc12-0413 - PMC - PubMed
    1. Davies MJ, Aroda VR, Collins BS et al (2022) Management of hyperglycemia in type 2 diabetes, 2022. A consensus report by the American Diabetes Association (ADA) and the European Association for the Study of Diabetes (EASD). Diabetes Care 45(11):2753–2786. 10.2337/dci22-0034 - PMC - PubMed
    1. Pearson ER (2019) Type 2 diabetes: a multifaceted disease. Diabetologia 62(7):1107–1112. 10.1007/s00125-019-4909-y - PMC - PubMed

Substances

LinkOut - more resources