The following ML checklist has been followed in this project
-
Understand the problem statement
-
Import the required libraries
-
Fetch and read the required data set
-
Explore the data
-
Split the dataset using Stratified Splits [imp when dealing with small dataset]
Stratified Splits: Split the dataset based on specified column where we want our train & test sets to have the same apporximate distribution. This needs to be done before any substantial visualization, in this way biases inherent can be avoided
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.33)
X_train
-
Visualize the data
Data visualization is one of the most important part of the modelling process. Statistics are underpar in providing details about a dataset
-
Using Histogram
foo.hist(bins=50, figsize=(20, 15))
-
Using Correlation Matrices
corr = var.corr() mask = np.triu(np.ones_like(corr, dtype=bool)) f, ax = plt.subplots(figsize=(11,9)) cmap = sns.diverging_palette(230,20, as_cmap=True) sns.heatmap(corr, mask=mask, cmap=cmap, vmax=3, center=0, square=True, linewidth=.5, cbar_kws={"shrink": .5})
-
Note
The project contains the test data set of a real estate company