The following ML checklist has been followed in this project
Understand the problem statement
Import the required libraries
Fetch and read the required data set
Explore the data
Split the dataset using Stratified Splits [imp when dealing with small dataset]
Stratified Splits: Split the dataset based on specified column where we want our train & test sets to have the same apporximate distribution. This needs to be done before any substantial visualization, in this way biases inherent can be avoided
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.33)
Visualize the data
Data visualization is one of the most important part of the modelling process. Statistics are underpar in providing details about a dataset
Using Histogram
foo.hist(bins=50, figsize=(20, 15))
Using Correlation Matrices
corr = var.corr() mask = np.triu(np.ones_like(corr, dtype=bool)) f, ax = plt.subplots(figsize=(11,9)) cmap = sns.diverging_palette(230,20, as_cmap=True) sns.heatmap(corr, mask=mask, cmap=cmap, vmax=3, center=0, square=True, linewidth=.5, cbar_kws={"shrink": .5})
The project contains the test data set of a real estate company