Skip to content

Machine Learning Exercise: Exploring categorical plots, LabelEncoder, pipelines and GridSearchCV using Telco Customer Churn data from Kaggle

Notifications You must be signed in to change notification settings

nicolelumagui/ML-Exercise_Telco-Customer-Churn_Classification

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 

Repository files navigation

MACHINE LEARNING EXERCISE: CLASSIFICATION

TELCO CUSTOMER CHURN

Models

  • Logistic Regression
  • K-Nearest Neighbors
  • Naive Bayes
  • Decision Trees
  • Random Forest Classifier
  • XGBoost

About

  • "Predict behavior to retain customers. You can analyze all relevant customer data and develop focused customer retention programs." [IBM Sample Data Sets]

Target Variable

  • Churn (Yes / No)

Features

  1. gender
  2. SeniorCitizen
  3. Partner
  4. Dependents
  5. tenure
  6. PhoneService
  7. MultipleLines
  8. InternetService
  9. OnlineSecurity
  10. OnlineBackup
  11. DeviceProtection
  12. TechSupport
  13. StreamingTV
  14. StreamingMovies
  15. Contract
  16. PaperlessBilling
  17. PaymentMethod
  18. MonthlyCharges
  19. TotalCharges

Source

Import Libraries

##### Standard Libraries #####
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_style("whitegrid")
sns.set_context("poster")

%matplotlib inline
##### Other Libraries #####

## ML Algorithms ##
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier

## To visualize decision tree ##
from sklearn.externals.six import StringIO
from IPython.display import Image
from sklearn.tree import export_graphviz
import pydotplus

## For building models ##
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.utils import resample
from sklearn.pipeline import make_pipeline

## For measuring performance ##
from sklearn import metrics
from sklearn.model_selection import cross_val_score, cross_validate 

## Ignore warnings ##
import warnings
warnings.filterwarnings('ignore')

Load the Dataset

First of all, let us load the dataset then check if it is properly loaded by showing a snippet of the data and checking its columns.

### Load the data
df = pd.read_csv("WA_Fn-UseC_-Telco-Customer-Churn.csv", index_col="customerID")

### Check if the data is properly loaded
print("Size of the dataset:", df.shape)
df.head()
Size of the dataset: (7043, 20)
gender SeniorCitizen Partner Dependents tenure PhoneService MultipleLines InternetService OnlineSecurity OnlineBackup DeviceProtection TechSupport StreamingTV StreamingMovies Contract PaperlessBilling PaymentMethod MonthlyCharges TotalCharges Churn
customerID
7590-VHVEG Female 0 Yes No 1 No No phone service DSL No Yes No No No No Month-to-month Yes Electronic check 29.85 29.85 No
5575-GNVDE Male 0 No No 34 Yes No DSL Yes No Yes No No No One year No Mailed check 56.95 1889.5 No
3668-QPYBK Male 0 No No 2 Yes No DSL Yes Yes No No No No Month-to-month Yes Mailed check 53.85 108.15 Yes
7795-CFOCW Male 0 No No 45 No No phone service DSL Yes No Yes Yes No No One year No Bank transfer (automatic) 42.30 1840.75 No
9237-HQITU Female 0 No No 2 Yes No Fiber optic No No No No No No Month-to-month Yes Electronic check 70.70 151.65 Yes

From the data snippet above, majority of the columns have data in words/strings. Only tenure, MonthlyCharges and TotalCharges are the columns containing continuous numbers. While the SeniorCitizen also contains numbers, it only has two distinct values which are 0 and 1.

Below is the list of columns of this dataset, along with the datatypes of the columns.

### List the columns along with its type
df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 7043 entries, 7590-VHVEG to 3186-AJIEK
Data columns (total 20 columns):
gender              7043 non-null object
SeniorCitizen       7043 non-null int64
Partner             7043 non-null object
Dependents          7043 non-null object
tenure              7043 non-null int64
PhoneService        7043 non-null object
MultipleLines       7043 non-null object
InternetService     7043 non-null object
OnlineSecurity      7043 non-null object
OnlineBackup        7043 non-null object
DeviceProtection    7043 non-null object
TechSupport         7043 non-null object
StreamingTV         7043 non-null object
StreamingMovies     7043 non-null object
Contract            7043 non-null object
PaperlessBilling    7043 non-null object
PaymentMethod       7043 non-null object
MonthlyCharges      7043 non-null float64
TotalCharges        7043 non-null object
Churn               7043 non-null object
dtypes: float64(1), int64(2), object(17)
memory usage: 660.3+ KB

Columns with data in Strings are considered object while columns with numerical data are either int or float.

The only numerical columns should be SeniorCitizen, tenure, MonthlyCharges and TotalCharges. But the column type ofTotalCharges is object, so we shall inspect it further.

Explore the Dataset

Let us look at the summary of statistics of the data. For now, the numerical columns will only be displayed.

### Summary of statistics for numerical data
df.describe()
SeniorCitizen tenure MonthlyCharges
count 7043.000000 7043.000000 7043.000000
mean 0.162147 32.371149 64.761692
std 0.368612 24.559481 30.090047
min 0.000000 0.000000 18.250000
25% 0.000000 9.000000 35.500000
50% 0.000000 29.000000 70.350000
75% 0.000000 55.000000 89.850000
max 1.000000 72.000000 118.750000

Notice that the minimum value at tenure column is 0, which may mean that that customer just started availing the service.

Now, let us look at the summary of categorical columns.

### Summary of statistics for categorical data
df.describe(include="O")
gender Partner Dependents PhoneService MultipleLines InternetService OnlineSecurity OnlineBackup DeviceProtection TechSupport StreamingTV StreamingMovies Contract PaperlessBilling PaymentMethod TotalCharges Churn
count 7043 7043 7043 7043 7043 7043 7043 7043 7043 7043 7043 7043 7043 7043 7043 7043 7043
unique 2 2 2 2 3 3 3 3 3 3 3 3 3 2 4 6531 2
top Male No No Yes No Fiber optic No No No No No No Month-to-month Yes Electronic check No
freq 3555 3641 4933 6361 3390 3096 3498 3088 3095 3473 2810 2785 3875 4171 2365 11 5174

From the table above, we can see the number of distinct values and the most common value per categorical column.

If we look at TotalCharges column, the top values is " " or space. It has 6531 distinct values out of 7043 rows which is very unideal for a categorical column.

Let us list the distinct values of TotalCharges column.

### Find the string in TotalCharges column
df["TotalCharges"].value_counts()[:10]
         11
20.2     11
19.75     9
19.9      8
19.65     8
20.05     8
45.3      7
19.55     7
20.15     6
19.45     6
Name: TotalCharges, dtype: int64

Looks like the " "/space values in the TotalCharges column cause problems in parsing this column into numerical.

Below are the 11 rows containing TotalCharges equals to space.

### Inspect the rows with TotalCharges==" "
df[ df["TotalCharges"]==" " ]
gender SeniorCitizen Partner Dependents tenure PhoneService MultipleLines InternetService OnlineSecurity OnlineBackup DeviceProtection TechSupport StreamingTV StreamingMovies Contract PaperlessBilling PaymentMethod MonthlyCharges TotalCharges Churn
customerID
4472-LVYGI Female 0 Yes Yes 0 No No phone service DSL Yes No Yes Yes Yes No Two year Yes Bank transfer (automatic) 52.55 No
3115-CZMZD Male 0 No Yes 0 Yes No No No internet service No internet service No internet service No internet service No internet service No internet service Two year No Mailed check 20.25 No
5709-LVOEQ Female 0 Yes Yes 0 Yes No DSL Yes Yes Yes No Yes Yes Two year No Mailed check 80.85 No
4367-NUYAO Male 0 Yes Yes 0 Yes Yes No No internet service No internet service No internet service No internet service No internet service No internet service Two year No Mailed check 25.75 No
1371-DWPAZ Female 0 Yes Yes 0 No No phone service DSL Yes Yes Yes Yes Yes No Two year No Credit card (automatic) 56.05 No
7644-OMVMY Male 0 Yes Yes 0 Yes No No No internet service No internet service No internet service No internet service No internet service No internet service Two year No Mailed check 19.85 No
3213-VVOLG Male 0 Yes Yes 0 Yes Yes No No internet service No internet service No internet service No internet service No internet service No internet service Two year No Mailed check 25.35 No
2520-SGTTA Female 0 Yes Yes 0 Yes No No No internet service No internet service No internet service No internet service No internet service No internet service Two year No Mailed check 20.00 No
2923-ARZLG Male 0 Yes Yes 0 Yes No No No internet service No internet service No internet service No internet service No internet service No internet service One year Yes Mailed check 19.70 No
4075-WKNIU Female 0 Yes Yes 0 Yes Yes DSL No Yes Yes Yes Yes No Two year No Mailed check 73.35 No
2775-SEFEE Male 0 No Yes 0 Yes Yes DSL Yes Yes No Yes No No Two year Yes Bank transfer (automatic) 61.90 No

We can observed from the table above that the rows where TotalCharges == " " did not churn, but most notably, also have tenure == 0. So, let's inpect further rows wherein tenure == 0.

### Display the number of zero values 
print("---Count zero values---")
print("Tenure: {}".format( df["tenure"].value_counts()[0] ))

df[ df["tenure"]==0 ]
---Count zero values---
Tenure: 11
gender SeniorCitizen Partner Dependents tenure PhoneService MultipleLines InternetService OnlineSecurity OnlineBackup DeviceProtection TechSupport StreamingTV StreamingMovies Contract PaperlessBilling PaymentMethod MonthlyCharges TotalCharges Churn
customerID
4472-LVYGI Female 0 Yes Yes 0 No No phone service DSL Yes No Yes Yes Yes No Two year Yes Bank transfer (automatic) 52.55 No
3115-CZMZD Male 0 No Yes 0 Yes No No No internet service No internet service No internet service No internet service No internet service No internet service Two year No Mailed check 20.25 No
5709-LVOEQ Female 0 Yes Yes 0 Yes No DSL Yes Yes Yes No Yes Yes Two year No Mailed check 80.85 No
4367-NUYAO Male 0 Yes Yes 0 Yes Yes No No internet service No internet service No internet service No internet service No internet service No internet service Two year No Mailed check 25.75 No
1371-DWPAZ Female 0 Yes Yes 0 No No phone service DSL Yes Yes Yes Yes Yes No Two year No Credit card (automatic) 56.05 No
7644-OMVMY Male 0 Yes Yes 0 Yes No No No internet service No internet service No internet service No internet service No internet service No internet service Two year No Mailed check 19.85 No
3213-VVOLG Male 0 Yes Yes 0 Yes Yes No No internet service No internet service No internet service No internet service No internet service No internet service Two year No Mailed check 25.35 No
2520-SGTTA Female 0 Yes Yes 0 Yes No No No internet service No internet service No internet service No internet service No internet service No internet service Two year No Mailed check 20.00 No
2923-ARZLG Male 0 Yes Yes 0 Yes No No No internet service No internet service No internet service No internet service No internet service No internet service One year Yes Mailed check 19.70 No
4075-WKNIU Female 0 Yes Yes 0 Yes Yes DSL No Yes Yes Yes Yes No Two year No Mailed check 73.35 No
2775-SEFEE Male 0 No Yes 0 Yes Yes DSL Yes Yes No Yes No No Two year Yes Bank transfer (automatic) 61.90 No

This proves that the rows that have TotalCharges == " " also have tenure == 0. This means that these customers just availed the service and have not been regularly charged for the service.

Let us remove these rows since it has incomplete data and may cause problems for modelling.

### Remove rows where tenure = 0 and TotalCharges is not numerical 
df1 = df.drop(df[ df["tenure"]==0 ].index, axis=0)

### Convert TotalCharges column from String to float
df1["TotalCharges"] = df1["TotalCharges"].astype(float)

### Check df1 to see if transformations are successful
df1.describe()
SeniorCitizen tenure MonthlyCharges TotalCharges
count 7032.000000 7032.000000 7032.000000 7032.000000
mean 0.162400 32.421786 64.798208 2283.300441
std 0.368844 24.545260 30.085974 2266.771362
min 0.000000 1.000000 18.250000 18.800000
25% 0.000000 9.000000 35.587500 401.450000
50% 0.000000 29.000000 70.350000 1397.475000
75% 0.000000 55.000000 89.862500 3794.737500
max 1.000000 72.000000 118.750000 8684.800000

Now that the problems with tenure and TotalCharges are fixed, we can further clean the data to be machine readable.

Let's convert those categories or strings per column into numerical labels using LabelEncoder. Listed below are the categories per column along with its transformation into numerical labels.

df_clean = df1
le = {}

print("-----Value for Categorical Columns-----")
for col in df_clean.columns:
    if not col in ["SeniorCitizen", "tenure", "MonthlyCharges", "TotalCharges"]:
        print( "--{}--\n{}".format(col, df_clean[col].value_counts()) )
        le[col] = LabelEncoder()
        df_clean[col] = le[col].fit_transform(df_clean[col])
        
        print( "Transformed:\n{}\n".format(df_clean[col].value_counts()) )
-----Value for Categorical Columns-----
--gender--
Male      3549
Female    3483
Name: gender, dtype: int64
Transformed:
1    3549
0    3483
Name: gender, dtype: int64

--Partner--
No     3639
Yes    3393
Name: Partner, dtype: int64
Transformed:
0    3639
1    3393
Name: Partner, dtype: int64

--Dependents--
No     4933
Yes    2099
Name: Dependents, dtype: int64
Transformed:
0    4933
1    2099
Name: Dependents, dtype: int64

--PhoneService--
Yes    6352
No      680
Name: PhoneService, dtype: int64
Transformed:
1    6352
0     680
Name: PhoneService, dtype: int64

--MultipleLines--
No                  3385
Yes                 2967
No phone service     680
Name: MultipleLines, dtype: int64
Transformed:
0    3385
2    2967
1     680
Name: MultipleLines, dtype: int64

--InternetService--
Fiber optic    3096
DSL            2416
No             1520
Name: InternetService, dtype: int64
Transformed:
1    3096
0    2416
2    1520
Name: InternetService, dtype: int64

--OnlineSecurity--
No                     3497
Yes                    2015
No internet service    1520
Name: OnlineSecurity, dtype: int64
Transformed:
0    3497
2    2015
1    1520
Name: OnlineSecurity, dtype: int64

--OnlineBackup--
No                     3087
Yes                    2425
No internet service    1520
Name: OnlineBackup, dtype: int64
Transformed:
0    3087
2    2425
1    1520
Name: OnlineBackup, dtype: int64

--DeviceProtection--
No                     3094
Yes                    2418
No internet service    1520
Name: DeviceProtection, dtype: int64
Transformed:
0    3094
2    2418
1    1520
Name: DeviceProtection, dtype: int64

--TechSupport--
No                     3472
Yes                    2040
No internet service    1520
Name: TechSupport, dtype: int64
Transformed:
0    3472
2    2040
1    1520
Name: TechSupport, dtype: int64

--StreamingTV--
No                     2809
Yes                    2703
No internet service    1520
Name: StreamingTV, dtype: int64
Transformed:
0    2809
2    2703
1    1520
Name: StreamingTV, dtype: int64

--StreamingMovies--
No                     2781
Yes                    2731
No internet service    1520
Name: StreamingMovies, dtype: int64
Transformed:
0    2781
2    2731
1    1520
Name: StreamingMovies, dtype: int64

--Contract--
Month-to-month    3875
Two year          1685
One year          1472
Name: Contract, dtype: int64
Transformed:
0    3875
2    1685
1    1472
Name: Contract, dtype: int64

--PaperlessBilling--
Yes    4168
No     2864
Name: PaperlessBilling, dtype: int64
Transformed:
1    4168
0    2864
Name: PaperlessBilling, dtype: int64

--PaymentMethod--
Electronic check             2365
Mailed check                 1604
Bank transfer (automatic)    1542
Credit card (automatic)      1521
Name: PaymentMethod, dtype: int64
Transformed:
2    2365
3    1604
0    1542
1    1521
Name: PaymentMethod, dtype: int64

--Churn--
No     5163
Yes    1869
Name: Churn, dtype: int64
Transformed:
0    5163
1    1869
Name: Churn, dtype: int64
df_clean.describe()
gender SeniorCitizen Partner Dependents tenure PhoneService MultipleLines InternetService OnlineSecurity OnlineBackup DeviceProtection TechSupport StreamingTV StreamingMovies Contract PaperlessBilling PaymentMethod MonthlyCharges TotalCharges Churn
count 7032.000000 7032.000000 7032.000000 7032.000000 7032.000000 7032.000000 7032.000000 7032.000000 7032.000000 7032.000000 7032.000000 7032.000000 7032.000000 7032.000000 7032.000000 7032.000000 7032.000000 7032.000000 7032.000000 7032.000000
mean 0.504693 0.162400 0.482509 0.298493 32.421786 0.903299 0.940557 0.872582 0.789249 0.905859 0.903868 0.796359 0.984926 0.992890 0.688567 0.592719 1.573237 64.798208 2283.300441 0.265785
std 0.500014 0.368844 0.499729 0.457629 24.545260 0.295571 0.948627 0.737271 0.859962 0.880394 0.880178 0.861674 0.885285 0.885385 0.832934 0.491363 1.067504 30.085974 2266.771362 0.441782
min 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 18.250000 18.800000 0.000000
25% 0.000000 0.000000 0.000000 0.000000 9.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 35.587500 401.450000 0.000000
50% 1.000000 0.000000 0.000000 0.000000 29.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 0.000000 1.000000 2.000000 70.350000 1397.475000 0.000000
75% 1.000000 0.000000 1.000000 1.000000 55.000000 1.000000 2.000000 1.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 1.000000 1.000000 2.000000 89.862500 3794.737500 1.000000
max 1.000000 1.000000 1.000000 1.000000 72.000000 1.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 1.000000 3.000000 118.750000 8684.800000 1.000000

Looks like all of the columns are now numerical and ready for further exploration and modelling.

Visualization

Let us further analyze data through visualizations.

### Function for KDE plots
def kdeplot_churn(col):
    ## Set size of figure 
    plt.figure(figsize=(20,7))

    ## KDE plots for each category label using Seaborn
    sns.kdeplot(data=df_clean[df_clean["Churn"]==0][col], 
                label="Churn - No", shade=True)
    sns.kdeplot(data=df_clean[df_clean["Churn"]==1][col], 
                label="Churn - Yes", shade=True)

    ## Label x ticks
    if not col in ["SeniorCitizen", "tenure", "MonthlyCharges", "TotalCharges"]:
        plt.xticks( np.arange(len(le[col].classes_)), (le[col].classes_) )
    
    ## Add title
    plt.title("DISTRIBUTION OF {} BY CHURN".format(col.upper()))

Shown below are the KDE plots and category plots of each predictor column versus the target variable Churn. From these plots, we can see how balanced/imbalanced the datas are, and which categories likely classify Churn.

### KDE plot to see distributions by churn
for col in df_clean.columns:
    if col != "Churn":
        kdeplot_churn(col)
        if not col in ["MonthlyCharges", "TotalCharges", "tenure"]:
            plt.figure(figsize=(20,7))
            sns.catplot(x=col, kind="count", hue="Churn", palette="ch:.25", data=df_clean)

png

<Figure size 1440x504 with 0 Axes>

png

png

<Figure size 1440x504 with 0 Axes>

png

png

<Figure size 1440x504 with 0 Axes>

png

png

<Figure size 1440x504 with 0 Axes>

png

png

png

<Figure size 1440x504 with 0 Axes>

png

png

<Figure size 1440x504 with 0 Axes>

png

png

<Figure size 1440x504 with 0 Axes>

png

png

<Figure size 1440x504 with 0 Axes>

png

png

<Figure size 1440x504 with 0 Axes>

png

png

<Figure size 1440x504 with 0 Axes>

png

png

<Figure size 1440x504 with 0 Axes>

png

png

<Figure size 1440x504 with 0 Axes>

png

png

<Figure size 1440x504 with 0 Axes>

png

png

<Figure size 1440x504 with 0 Axes>

png

png

<Figure size 1440x504 with 0 Axes>

png

png

<Figure size 1440x504 with 0 Axes>

png

png

png

Prepare the Data for Modelling

Now, let us prepare the data for modelling.

Train-Test Split

The dataset is divided into train set and test set.

The train set is used for building the model, while the test set is used for validating the model.

### Separate the predictors from the target variable
X = df_clean.drop(["Churn"], axis=1)
y = df_clean["Churn"]

print("Size of x (predictors):\t{}\nSize of y (target):\t{}".format(X.shape, y.shape))
Size of x (predictors):	(7032, 19)
Size of y (target):	(7032,)
### Split the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y, random_state=1)

### Check shape to make sure it is all in order
print("Size of x_train: {} \t Size of x_test: {} \nSize of y_train: {} \t Size of y_test: {}".format(
    X_train.shape, X_test.shape, y_train.shape, y_test.shape))
Size of x_train: (4922, 19) 	 Size of x_test: (2110, 19) 
Size of y_train: (4922,) 	 Size of y_test: (2110,)
print(y_train.value_counts(), '\n', y_test.value_counts())
0    3614
1    1308
Name: Churn, dtype: int64 
 0    1549
1     561
Name: Churn, dtype: int64

Shown above is the distribution of Churn data, and we can see that it is imbalanced. The dataset is dominated by rows where Churn == 0, and we can make it more balanced using the resampling technique performed below.

Resample

### Concatenate the train data before resampling 
df_train = pd.concat([X_train, y_train], axis=1)
print("DF Train shape:", df_train.shape, "\nDF Train value counts:\n",df_train['Churn'].value_counts())
df_train.head()
DF Train shape: (4922, 20) 
DF Train value counts:
 0    3614
1    1308
Name: Churn, dtype: int64
gender SeniorCitizen Partner Dependents tenure PhoneService MultipleLines InternetService OnlineSecurity OnlineBackup DeviceProtection TechSupport StreamingTV StreamingMovies Contract PaperlessBilling PaymentMethod MonthlyCharges TotalCharges Churn
customerID
7105-MXJLL 0 1 1 0 26 1 0 0 0 0 2 0 0 2 1 0 3 60.70 1597.40 0
6754-WKSHP 1 0 0 1 30 1 2 2 1 1 1 1 1 1 1 0 0 25.35 723.30 0
1984-FCOWB 0 0 1 0 70 1 2 1 0 2 2 2 2 2 1 1 2 109.50 7674.55 1
5188-HGMLP 1 1 1 0 54 1 2 1 0 0 0 0 0 0 0 0 0 74.00 3919.15 0
0196-JTUQI 0 0 0 0 9 1 2 1 0 0 0 0 0 0 0 1 1 75.20 633.85 0
### Append the category with highest value counts
df_train_res = df_train[df_train["Churn"]==0]

### Resample minority class
resampled = resample(df_train[df_train["Churn"]==1],
                    replace=True, # sample with replacement
                    n_samples=3614, # match number in majority class
                    random_state=1) # reproducible results
df_train_res = pd.concat([df_train_res, resampled]) 

### Print resampled training set to check
print("Size of df_train_res:", df_train_res.shape, "\nValue counts for Churn:\n", 
      df_train_res["Churn"].value_counts())
df_train_res.head()
Size of df_train_res: (7228, 20) 
Value counts for Churn:
 1    3614
0    3614
Name: Churn, dtype: int64
gender SeniorCitizen Partner Dependents tenure PhoneService MultipleLines InternetService OnlineSecurity OnlineBackup DeviceProtection TechSupport StreamingTV StreamingMovies Contract PaperlessBilling PaymentMethod MonthlyCharges TotalCharges Churn
customerID
7105-MXJLL 0 1 1 0 26 1 0 0 0 0 2 0 0 2 1 0 3 60.70 1597.40 0
6754-WKSHP 1 0 0 1 30 1 2 2 1 1 1 1 1 1 1 0 0 25.35 723.30 0
5188-HGMLP 1 1 1 0 54 1 2 1 0 0 0 0 0 0 0 0 0 74.00 3919.15 0
0196-JTUQI 0 0 0 0 9 1 2 1 0 0 0 0 0 0 0 1 1 75.20 633.85 0
8755-IWJHN 1 1 1 0 69 1 2 1 0 2 2 0 0 2 0 1 0 95.35 6382.00 0
X_train_res = df_train_res.drop(["Churn"], axis=1)
y_train_res = df_train_res["Churn"]

print("Size of x_train_res: {}\nSize of y_train_res: {}".format(X_train_res.shape, y_train_res.shape))
Size of x_train_res: (7228, 19)
Size of y_train_res: (7228,)

Eventhough we have resampled data, we will still consider the original data and see which of those datasets enhances model performance.

Build the Models

Now that our data is prepared for modelling, let us initialize functions that we will use for it.

### Function for computing for the recall of Churn==0
def recall_0(correct, pred):
    return metrics.recall_score(y_true=correct, y_pred=pred, pos_label=0, average="binary")

### Function for getting and displaying score
def score_card(model, predicted):
    ## Print classification report
    print( "Classification report for {}:\n{}".format( model, 
                                                metrics.classification_report(y_test, predicted) ) )

    ## Cross-Validation
    scoring = {
        "recall_0": metrics.make_scorer(recall_0),
        "accuracy": "accuracy",
        "recall": "recall",
        "precision": "precision",
        "f1": "f1"
    }
    cv_scores = cross_validate(model, X, y, scoring=scoring, cv=10)
    
    ## Return scores
    return { c: np.mean(cv_scores[c]) for c in cv_scores }
### Initialized for easy plotting of confusion matrix
def confmatrix(y_pred, title):
    cm = metrics.confusion_matrix(y_test, y_pred)
    df_cm = pd.DataFrame(cm, columns=np.unique(y_test), index = np.unique(y_test))
    df_cm.index.name = 'Actual'
    df_cm.columns.name = 'Predicted'
    
    plt.figure(figsize = (10,7))
    plt.title(title)
    
    sns.set(font_scale=1.4) # For label size
    sns.heatmap(df_cm, cmap="Blues", annot=True,annot_kws={"size": 16}) # Font size

Pipeline

For quickly choosing the best model, we can construct pipelines for each of the machine algorithms:

  • Logistic Regression
  • K-Nearest Neighbors
  • Naive Bayes
  • Decision Trees
  • Random Forest Classifier
  • XGBoost

We'll also use two types datasets per algorithm:

  • Resampled Data
  • Original Data

The top 3 models with the highest accuracy scores are used for further modelling and tuning.

### Make pipelines
pipelines = {
    "LogReg": make_pipeline(StandardScaler(), LogisticRegression()),
    "KNN": make_pipeline(StandardScaler(), KNeighborsClassifier()),
    "GNB": make_pipeline(StandardScaler(), GaussianNB()),
    "DTree": make_pipeline(StandardScaler(), DecisionTreeClassifier()),
    "RF": make_pipeline(StandardScaler(), RandomForestClassifier()),
    "XGB": make_pipeline(StandardScaler(), GradientBoostingClassifier())
}
### Initialize dict for results
pipe_results = {}

### Iterate through pipelines to get each accuracy score
for pipe in pipelines:
    model_orig = pipelines[pipe]
    model_orig.fit(X_train, y_train)
    
    pipe_results[pipe] = [metrics.accuracy_score(y_test, model_orig.predict(X_test)) * 100]
    
    model_res = pipelines[pipe]
    model_res.fit(X_train_res, y_train_res)
    
    pipe_results[pipe].append(metrics.accuracy_score(y_test, model_res.predict(X_test)) * 100)

### Print accuracy scores got per pipeline
print("-------Accuracy Scores per Pipeline-------")
print("Algorithm\tOriginal\tResampled")
for pipe in pipe_results:
    print( "%s\t\t%.3f\t\t%.3f" % (pipe, pipe_results[pipe][0], pipe_results[pipe][1]) )
-------Accuracy Scores per Pipeline-------
Algorithm	Original	Resampled
LogReg		80.142		73.602
KNN		75.498		67.156
GNB		74.645		72.986
DTree		71.801		72.085
RF		78.199		77.346
XGB		80.332		74.882

From the results above, the top 3 models are:

  1. XGBoost Classifier
  2. Logistic Regression
  3. Random Forest Classifier

Below, we'll try to tune these models using GridSearchCV and find out which ones are the best for this use case. Note that not all of the parameters will be tuned due to limits in memory.

Random Forest Classifier

Build/Train the Model

Find Best Parameters
rf_parameters = {
    "randomforestclassifier__n_estimators" : [n for n in np.arange(100,800,200)], 
    "randomforestclassifier__criterion" : ["gini", "entropy"]
}

rf_grid = GridSearchCV(pipelines["RF"], param_grid = rf_parameters,
                           cv = 5, scoring="accuracy")

rf_grid.fit(X, y)
print("-----RF GridSearch-----")
print( "Best Params: {}\nBest Score: {}".format(rf_grid.best_params_, rf_grid.best_score_) )
-----RF GridSearch-----
Best Params: {'randomforestclassifier__criterion': 'entropy', 'randomforestclassifier__n_estimators': 100}
Best Score: 0.7945108077360638
Use best parameters
### Instantiate the algorithm
rf = RandomForestClassifier(criterion="entropy", n_estimators=100)

### Fit the model
rf.fit(X_train, y_train)
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='entropy',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

Validate the Model

Classification Report
### Predict on test set
rf_pred = rf.predict(X_test)

### Get score report for model
rf_score = score_card(rf, rf_pred)
pd.DataFrame.from_dict({"Cross-Validation Scores":rf_score})
Classification report for RandomForestClassifier(bootstrap=True, class_weight=None, criterion='entropy',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False):
              precision    recall  f1-score   support

           0       0.83      0.90      0.86      1549
           1       0.64      0.49      0.55       561

    accuracy                           0.79      2110
   macro avg       0.73      0.69      0.71      2110
weighted avg       0.78      0.79      0.78      2110
Cross-Validation Scores
fit_time 8.169977
score_time 0.652181
test_accuracy 0.794935
test_f1 0.563180
test_precision 0.650402
test_recall 0.497059
test_recall_0 0.902765
Confusion Matrix
### Plot the confusion matrix
confmatrix(rf_pred, "Random Forest - Telco Customer Churn\nConfusion Matrix")

png

Logistic Regression

Build/Train the Model

Find Best Parameters
logreg_parameters = {
    "logisticregression__penalty" : ["l1", "l2"], 
    "logisticregression__C" : [n for n in np.arange(0.5,3,0.5)]
}

logreg_grid = GridSearchCV(pipelines["LogReg"], param_grid = logreg_parameters,
                           cv = 5, scoring="accuracy")

logreg_grid.fit(X, y)
print("-----LogReg GridSearch-----")
print( "Best Params: {}\nBest Score: {}".format(logreg_grid.best_params_, logreg_grid.best_score_) )
-----LogReg GridSearch-----
Best Params: {'logisticregression__C': 0.5, 'logisticregression__penalty': 'l1'}
Best Score: 0.801905574516496
Use best parameters
### Instantiate the algorithm
logreg = LogisticRegression(C=0.5, penalty="l1")

### Fit the model
logreg.fit(X_train, y_train)
LogisticRegression(C=0.5, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l1',
                   random_state=None, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

Validate the Model

Classification Report
### Predict on test set
logreg_pred = logreg.predict(X_test)

### Get score report for model
logreg_score = score_card(logreg, logreg_pred)
pd.DataFrame.from_dict({"Cross-Validation Scores":logreg_score})
Classification report for LogisticRegression(C=0.5, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l1',
                   random_state=None, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False):
              precision    recall  f1-score   support

           0       0.85      0.89      0.87      1549
           1       0.65      0.56      0.60       561

    accuracy                           0.80      2110
   macro avg       0.75      0.73      0.73      2110
weighted avg       0.79      0.80      0.80      2110
Cross-Validation Scores
fit_time 0.696976
score_time 0.076322
test_accuracy 0.801480
test_f1 0.592991
test_precision 0.650664
test_recall 0.545219
test_recall_0 0.894247
Confusion Matrix
### Plot the confusion matrix
confmatrix(logreg_pred, "LogReg - Telco Customer Churn\nConfusion Matrix")

png

XGBoost

Build/Train the Model

Find Best Parameters
xgb_parameters = {
    "gradientboostingclassifier__loss" : ["deviance", "exponential"], 
    "gradientboostingclassifier__n_estimators" : [n for n in np.arange(100,800,200)]
}

xgb_grid = GridSearchCV(pipelines["XGB"], param_grid = xgb_parameters,
                           cv = 5, scoring="accuracy")

xgb_grid.fit(X, y)
print("-----XGB GridSearch-----")
print( "Best Params: {}\nBest Score: {}".format(xgb_grid.best_params_,xgb_grid.best_score_) )
-----XGB GridSearch-----
Best Params: {'gradientboostingclassifier__loss': 'deviance', 'gradientboostingclassifier__n_estimators': 100}
Best Score: 0.8056029579067122
Use best parameters
### Instantiate the algorithm
xgb = GradientBoostingClassifier(loss="deviance", n_estimators=100)

### Fit the model
xgb.fit(X_train, y_train)
GradientBoostingClassifier(criterion='friedman_mse', init=None,
                           learning_rate=0.1, loss='deviance', max_depth=3,
                           max_features=None, max_leaf_nodes=None,
                           min_impurity_decrease=0.0, min_impurity_split=None,
                           min_samples_leaf=1, min_samples_split=2,
                           min_weight_fraction_leaf=0.0, n_estimators=100,
                           n_iter_no_change=None, presort='auto',
                           random_state=None, subsample=1.0, tol=0.0001,
                           validation_fraction=0.1, verbose=0,
                           warm_start=False)

Validate the Model

Classification Report
### Predict on test set
xgb_pred = xgb.predict(X_test)

### Get score report for model
xgb_score = score_card(xgb, xgb_pred)
pd.DataFrame.from_dict({"Cross-Validation Scores":xgb_score})
Classification report for GradientBoostingClassifier(criterion='friedman_mse', init=None,
                           learning_rate=0.1, loss='deviance', max_depth=3,
                           max_features=None, max_leaf_nodes=None,
                           min_impurity_decrease=0.0, min_impurity_split=None,
                           min_samples_leaf=1, min_samples_split=2,
                           min_weight_fraction_leaf=0.0, n_estimators=100,
                           n_iter_no_change=None, presort='auto',
                           random_state=None, subsample=1.0, tol=0.0001,
                           validation_fraction=0.1, verbose=0,
                           warm_start=False):
              precision    recall  f1-score   support

           0       0.84      0.91      0.87      1549
           1       0.67      0.51      0.58       561

    accuracy                           0.80      2110
   macro avg       0.75      0.71      0.73      2110
weighted avg       0.79      0.80      0.79      2110
Cross-Validation Scores
fit_time 6.034562
score_time 0.133802
test_accuracy 0.803471
test_f1 0.583099
test_precision 0.669511
test_recall 0.517388
test_recall_0 0.907029
Confusion Matrix
### Plot the confusion matrix
confmatrix(xgb_pred, "XGBoost - Telco Customer Churn\nConfusion Matrix")

png

Summary of the Results

### Compile results into dataframe
df_results = pd.DataFrame.from_dict({
    "Logistic Regression": logreg_score, "XGBoost": xgb_score,
    "Random Forest": rf_score}, orient="index")

### Convert scores into percentages
for m in ["test_recall_0", "test_accuracy", "test_recall", "test_precision", "test_f1"]:
    df_results[m] = df_results[m] * 100

### Show resulting dataframe
df_results
fit_time score_time test_recall_0 test_accuracy test_recall test_precision test_f1
Logistic Regression 0.696976 0.076322 89.424715 80.148018 54.521879 65.066414 59.299086
Random Forest 8.169977 0.652181 90.276528 79.493538 49.705882 65.040211 56.317993
XGBoost 6.034562 0.133802 90.702922 80.347064 51.738830 66.951062 58.309915
### Find out the best model based on all metrics
print("-----Best Model per Metric-----")

## Initiate dict of best model pre metric
best = {"test_recall_0":0, "test_accuracy":0, "test_recall":0, "test_precision":0, "test_f1":0}

## Iterate through the dict and columns of df_results to find the max value and index
for m in best:
    best[m] = { "Model":df_results[m].idxmax(), "Score":df_results[m].max() }

## Display the results
df_best = pd.DataFrame.from_dict(best, orient="index")
print(df_best["Model"].value_counts()[:1])
df_best
-----Best Model per Metric-----
XGBoost    3
Name: Model, dtype: int64
Model Score
test_accuracy XGBoost 80.347064
test_f1 Logistic Regression 59.299086
test_precision XGBoost 66.951062
test_recall Logistic Regression 54.521879
test_recall_0 XGBoost 90.702922

In this exercise, we used LabelEncoder to give categories numerical labels

We tried categorical visualizations from Seaborn which are:

  • catplot - like a bar plot but for categories
  • kdeplot - like a smoothened histogram

We also explored SKLearn's make_pipeline which is very useful in saving time and typing when modelling.

Aside from that, we tried GridSearchCV to find the best parameters to use per model based on a dictionary of lists we defined.

For this data, we would want high recall to better detecting customers who are in risk of churning. Logistic Regression may have given the best score for recall but that score is still low.

So in this case, XGBoost is still the best model because eventhough it has less recall, XGBoost gave the best performance for most of the metrics. It has high accuracy, precision and recall(for 0).

Special Thanks

Releases

No releases published

Packages

No packages published