This a solution notebook to an assignment question given in a Data Mining graduate course. Each code block is accompanied by relevant analysis wherever required.
Dataset link: https://archive.ics.uci.edu/ml/datasets/Dry+Bean+Dataset
Broadly, the following steps have been performed in this solution notebook:
- Plotted the class distribution of the dataset and its analysis.
- Performed EDA (histograms, box plots,etc.) and provided various insights on the data.
- Used TSNE alogorithm to reduce data dimensions to 2 and plotted the resulting data as scatterplot.
- This helps in observing the separability of the data.
- Ran the sklearn implementation of Gaussian Naive Bayes and Multinomial Naive Bayes.
- Reported Accuracy, Recall, and Precision and analyzed the differences in the two implementations of Naive Bayes using the [80:20] train test split
- Used Principal Component Analysis (PCA) to reduce the number of features and used the reduced dataset for model training.
- Retained dfifferent amounts of variance values, ranging from 0.9 to 1 in steps of 0.01.
- Compared the results using Accuracy, Precision, Recall and F1-score.
- Plotted ROC-AUC curves
- Further trained the model using Multinomial Logistic Regression and compared the results with Naive Bayes.