December 12, 2021

Exploratory Data Analysis
Telco Customer Churn

Have you ever disappointed with a telecommunications company and finally decided to move to another company? Either because the price is too expensive, the signal is not good, or because the service is not good. Now it's called Customer Churn. In this Article, Arie will provide details about EDA and try to visualize the data obtained.

Data Understanding

Target :

  • Churn — Whether the customer churned or not (Yes or No)
  • Numeric Feature :

  • Tenure — Number of months the customer has been with the company.
  • MonthlyCharges — Monthly amount charged to customers
  • TotalCharges — Total amount charged to customers.
  • Categorical Feature :

  • Customer ID — The Identity of customer.
  • Gender — Male/Female.
  • SeniorCitizen — Whether the customer is a senior citizen or not (1, 0).
  • Partner — Whether the customer has a partner or not (Yes, No).
  • Dependents — Whether the customer has dependents or not (Yes, No).
  • PhoneService — Whether the customer has telephone service or not (Yes, No).
  • InternetService — Type of customer internet service (DSL, Fiber Optic, None).
  • OnlineSecurity — Does the customer have an Online Security add-on (Yes, No, No Internet Service).
  • OnlineBackup — Does the customer have an Online Backup add-on (Yes, No, No Internet Service).
  • DeviceProtection — Does the customer have a Device Protection add-on (Yes, No, No Internet Service).
  • TechSupport — Does the customer have a Technical Support add-on (Yes, No, No Internet Service).
  • StreamingTV — Whether the customer has streaming TV or not (Yes, No, No Internet Service).
  • StreamingMovies — Whether the subscriber has streaming movies or not (Yes, No, No Internet Service).
  • Contract — Term of the customer's contract (Monthly, 1-Year, 2-Year).
  • PaperlessBilling — Whether the customer has a paperless bill or not (Yes, No).
  • PaymentMethod — Customer payment methods (E-Check, Mailed Check, Bank Transfer (Auto), Credit Card (Auto)).
  • Import Package and Data

    Started with imports of some basic libraries that are needed throughout the case. This includes Pandas and Numpy for data handling and processing as well as Matplotlib and Seaborn for visualization.

    import pandas as pd
    import numpy as np
    import re
    
    import matplotlib.pyplot as plt
    import seaborn as sns
    %matplotlib inline
    sns.set()
    

    For this exercise, the data set (.csv format) is downloaded to a local folder, read into the Jupyter notebook and stored in a Pandas DataFrame.

    import pandas as pd
    df = pd.read_csv('Telco Customer Churn.csv')
    #overview dari dataset
    df.head()
    

    Initial EDA

    The first part of EDA the data frame is evaluated for structure, columns included and data types to get a general understanding for the data set. Get a summary on the data frame include data types, shape, and memory storage.

    Get statistical information on numerical features.

    Insight(Data Visualization + Summary)

    Display Frequency Distribution For Churn

    From the picture above, it can be seen that the overall distribution of data customers tend not to churn, with details of churn as much as 27% and non-churn as much as 73%. Since churn is the target variable, it will be used as an element in most EDA variables.

    Exploratory Data Analysis (EDA) Numeric Variable

    Insight:

  • Churn customers had significantly lower tenure with a median of 10 months compared to a median of 38 months for non-churn.
  • Churn customers have higher monthly costs with a median of 80 USD and a much lower interquartile range than non-churners (median around 65 USD).
  • We can see that in MonthlyCharges there is a tendency that the higher the monthly fee charged, the higher the tendency to churn, from here service providers can set prices around 20 - 60 to reduce the tendency to churn.
  • Exploratory Data Analysis (EDA) Categorical Variable

    Insight:

  • Senior Citizen churn rate is much higher than non-senior churn rate.
  • The churn rate for Month to Month contracts is much higher than for other contract terms.
  • Churn customer without partners has a higher churn rate.
  • Churn customer without children has a much higher churn rate.
  • Payment method electronic check shows much higher churn rate than other payment methods.
  • Customers with InternetService fiber optic as part of their contract have much higher churn rate.
  • Feature Engineering Action

    Check Outlier in Numerical Feature

    Check outlier by applying the IQR methid checking if values are way outside the IQR borders or visualize with boxplot to see the outliers.

    There are no Ouliers on the numerical features when checked using the boxplot and the IQR method, so there is no need to do anything.

    The next step is drop the rows with missing values for data cleaning.Based on the data types and the values, following actions are defined to preprocess/engineer the features for machine readibility and further analysis:

    Column Removed

  • CustomerID: Not Relevan
  • No Need Action

  • SeniorCitizen
  • Label encoding The following features are categorical and each take on 2 values (mostly yes/no) — therefore are transformed to binary integers

  • Gender
  • Partner
  • Dependents
  • Churn
  • PhoneService
  • PaperlessBilling
  • One Hot Encoding The following features are categorical, yet not ordinal (no ranking) but take on more than 2 values. For each value, a new variable is created with a binary integer indicating if the value occured in a data entry or not (1 or 0).

  • MultipleLines
  • InternetService
  • OnlineSecurity
  • DeviceProtection
  • TechSupport
  • StreamingTV
  • StreamingMovies
  • Contract
  • PaymentMethod
  • Min-Max Scaling Values of numerical features are rescaled between a range of 0 and 1. Min-max scaler is the standard approach for scaling. For normally distributed features standard scaler could be used, which scales values around a mean of 0 and a standard deviation of 1. For simplicity we use min-max scaler for all numerical features.

  • tenure
  • TotalCharges
  • Monthly Charges
  • Generate new feature "Number_AdditionalSerivices" by summing up the number of add-on service consumed. Then Generate countplot for the new feature.

    Insight:

  • The countplot shows a very high churn rate for customers that have 1 additional service.
  • Customers with a very high number of additional services do have a low churn rate.
  • Correlation analysis

    Insight:

  • Month-to-month contract duration is the biggest driver of churn.
  • High tenure ranks as the strongest factor for non-rotation and the strongest feature overall. This is also supported by the boxplot in the EDA step.