Telecommunication Company Customer Churn Prediction in Order to Minimizing Company’s Cost

Ridho Aryo Pratama
10 min readSep 18, 2023

--

Photo by Mario Caruso on Unsplash

Introduction

In a telco company, there are two costs known as Acquisition Cost and Retention Cost. Acquisition Cost is the expense for a company to acquire new customers. Meanwhile, Retention Cost is the spending for the company to retain existing customers.

Due to human limitations, we are often wrong to predict which customers will churn and which customers will retain. So that the allocation of funds can be wrong so that the funds issued become larger.

Furthermore, according to various sources, the acquisition cost is seven times higher than the retention cost. If we make a mistake in predicting that the customer will continue using our product, but in reality, they churn, we end up spending more than necessary.

Objectives

In this project, our goal is to create a smart computer program using Machine Learning that can guess which customers might decide to stop using our services and which ones will continue using them. The idea is to make this guessing game as accurate as possible. Why? So we can be really smart about where we spend our money on trying to keep customers and where we don’t need to spend as much.

Objective 1

Identify the factors influencing customer churn.

Objective 2

Create a machine learning model capable of predicting churn.

Objective 3

Minimize costs.

GitHub Repo

Visit this link to get more details about the code and presentation: https://github.com/ridhoaryo/TelcoCustomerChurn

Assumption

We will attempt to make initial assumptions about retention cost and acquisition cost. For this case, we will consider the retention cost to be $10, while the acquisition cost is $70 (7 times larger).

Retention Cost

Retention cost, also known as Customer Retention Cost, refers to the expenses a company incurs to retain its existing customers and prevent them from leaving or churning.

Acquisition Cost

Acquisition cost, often referred to as Customer Acquisition Cost (CAC), is the expense incurred by a company to acquire new customers. It represents the cost associated with convincing a potential customer to make their first purchase or start using a company’s products or services.

The Dataset

Read the dataset
The Dataset

The dataset we are using is 7043 observations with 33 fictional telco companies that variables provided home phone and Internet services to 7043 customers in California in Q3. Visit this link to go to the dataset.

Data Understanding and Preparation

Report Function
There are still some columns that invisible..

From the dataframe above, there are several columns that have only one unique value, namely the columns [`Count`, `Country`, `State`]. Additionally, I will not use the `CustomerID` column because `customerID` does not determine the probability of someone churning or not.

`Zip code`, `Lat Long`, `Latitude`, and `Longitude` will also be deleted. I will not use them to build the Machine Learning model.

I will also remove `Churn Score` and `Churn Reason` since obtaining a customer’s Churn Score and Churn Reason data before they actually churn or leave is impossible in the future. Moreover, it would leak information when we build the model.

Rename and drop some unnecessary columns

Data Analysis

Churn Proportion

Make a Pie Chart

From the pie chart above, we can see that, 26.54% of the customers in this dataset are labelled as churn customers. This indicates that there is an imbalance between the number of customers leaving and those staying. While some might think that this dataset needs to be resampled or, in other words, the labels should be “balanced,” I am hesitant to take that step at this point. Instead, I will conduct further exploration to gain more insights.

Churn by Categorical Features

Before we delve further into the realm of machine learning modelling, we will try to examine the impact of each customer’s demographic and behavioral feature on their churn status.

Function to make visualisation and analysis for categorical feature vs target

Churn by Gender

Churn by Gender

It can be observed that the probability of churn based on gender does not differ significantly between men and women.

Churn by Senior Citizen Status

Churn by Seniors

The elderly have nearly twice the chance of churning compared to the younger generation.

Churn by Partner

Churn by Partner

Customers without a partner also have a tendency to churn almost twice as much as those who have a partner.

Churn by Partner

Churn by Partner

Customers who have children (or other dependents) have a 5 times greater chance of churning compared to customers who do not have them.

Churn by Internet Service Subscription

Churn by Internet Service Subscription

Customers who use a fiber optic connection for their internet service subscription have nearly 6 times greater chances of churning compared to those who do not subscribe to internet services.

Churn by Device Protection

Churn by Device Protection

Customers who do not subscribe to device protection have a tendency to churn more than 5 times that of those who use device protection.

Churn by Contract

Churn by Contract

Customers with month-to-month contracts are 15 times more likely to churn than customers with 2-year or annual contracts.

Churn by Payment Method

Churn by Payment Method

Customers who use electronic checks as their payment method are approximately 3 times more likely to churn compared to other payment methods.

Churn by Numerical Features

Now, we shift our focus to numerical features. We will try to examine the relationship between these numerical features and churn status.

Tenure Months Distribution by Churn

Tenure Months Distribution by Churn

We are observing the distribution of tenure months in relation to churn. We will also use an assessment using the Wilcoxon test to examine the relationship between the feature and the target. The ‘tenure months’ column does not have a strong enough relationship with churn.

Total Charges Distribution by Churn

Turns out, the ‘total charges’ column also does not have a strong enough relationship with churn.

CLTV Distribution by Churn

Turns out, the ‘CLTV’ column also does not have a strong enough relationship with churn.

Checking Multicollinearity

In addition to examining the relationship between Exogenous Variables and the target, we will also look at the potential multicollinearity among numerical variables. We will use correlation and VIF.

VIF, or Variance Inflation Factor, is a statistic used to measure how much the variance of an estimated regression coefficient increases when your predictors (independent variables) are correlated. In simpler terms, it helps us understand if there is a problem of multicollinearity in our data, which occurs when two or more independent variables in a regression model are highly correlated with each other. High VIF values (usually above 10) indicate a high degree of multicollinearity, which can make it difficult to interpret the effects of individual predictors in a regression model. It’s important to keep VIF values low to ensure the reliability of your regression analysis.

Check correlation and VIF
Correlation
VIF

Surprisingly, all features have high VIF values. The columns ‘Tenure Months,’ ‘Monthly Charges,’ and ‘Total Charges’ are interconnected. We can choose one of them. Here, we will use the ‘Total Charges’ column.

VIF after Deleting Tenure Months and Monthly Charges

Modelling Strategies

For the modelling stage, this churn prediction will focus on as much as possible in predicting which customers will churn. As we know, if this model predicts ‘retain’ for a customer that will actually ‘churn’, then we will lose the customer. According to some opinions, the cost of customer acquisition is 7 times greater than the cost of retaining the customer.

So that in modeling this time we will focus on the Recall Score. We will try to use the PR (Precision Recall) Curve to find the optimum threshold.

The strategy is, first, to divide the data into two parts: Training Data (80%) and Testing Data (20%). From the Training Data, we will further split it into two parts: Training Data and Validation Data. The Validation Data will be used to assess the model’s predictive abilities.

Train Test Split

The result of this splitting step is, that the training set has a 5634 x 19 shape and the testing set has a 1409 x 19 shape.

Initiating Machine Learning Pipeline

The second step involves creating a Pipeline that allows us to process the data up to the model fitting stage. Why use a Pipeline? This is to avoid data leakage.

Data leakage occurs when information from the testing dataset “leaks” into the training process. In other words, the model unintentionally gains access to information it shouldn’t have during training, which can lead to unrealistically optimistic performance estimates.

Machine Learning Pipeline to Avoid Data Leakage

Deterimine Metric Evaluation

We will attempt to optimize theRecall value. Referring to our main TP+FN objective, which is to minimize

costs, and considering that the largest cost comes from acquisition cost, we aim to minimize the possibility of customers who are likely to churn but do not receive any form of compensation to stay (retain cost).

Cross Validation Step

Plot CV Result

Next, we will perform cross-validation to assess how stable the model is when learning from gradually changing training data. From this stage, we can obtain an average score of 0.792 or 79.2% for Recall.

Training The Model

After that, we will proceed with the training step and we obtain the Recall score, which is not too far from the cross-validation result, at 0.799 or 79.9% Recall.

Pipeline Result

Threshold Optimization

Next, we will proceed with threshold optimization, which involves finding the ideal balance between precision and recall in a classification problem. By adjusting the threshold, we can control the trade-off between correctly identifying positive cases (churn, in this case) and minimizing false positives.

After optimization, we achieved an optimal threshold of approximately 0.31, which significantly boosted our recall to 0.89 or 89%.

Precision — Recall Curve
Optimal Threshold

Conclusion

After obtaining the best results, the challenge now is how to present it to stakeholders. Here’s how to interpret it:

Recall Interpretation

Interpretation

If we have 1000 customers with the potential to churn, then we will accurately predict that 890 customers will churn. Thus, we can allocate a retaining cost of $8,900 (assuming the initial retention cost is $10 per customer). However, there are 110 customers we did not predict accurately, so we need to spend an acquisition cost to replace those lost customers, which amounts to $7,700 (assuming the initial acquisition cost is $70 per customer). The total cost we incur is approximately $16,600.

Now, what if we didn’t use machine learning?

Interpretation

Without using machine learning, the most optimistic guess would have an accuracy of 50%. This means we would accurately guess 500 customers who will churn. Thus, we can allocate a retaining cost of $5,000 (assuming the initial retention cost is $10 per customer). However, there are 500 customers we did not guess correctly, so we need to spend an acquisition cost to replace those lost customers, which amounts to $35,000 (assuming the initial acquisition cost is $70 per customer). The total cost we incur is approximately $40,000.

Precision Effect

Gentle reminder for you, don’t forget we still have Precision to be interpreted. How are we going to do this? First thing first, I would like to remind you about the Confusion Matrix we produce earlier.

Confusion Matrix

From here, we can infer that if we have 890 correct predictions of churn (True Positives) but only a 43% Precision, it means that we have approximately 1180 predictions that are false positives (FP). In total, we have 2070 churn predictions.

Out of the 2070 customers we predicted as churn, we only accurately predicted 890 who actually churned. The rest are customers who were retained but predicted as churn. So, we still need to allocate retention cost for those retaining customers. In other words, out of the $20,700 we spent on retention cost, $8,900 hit the mark, and the rest did not.

Cost Effectiveness

Comparison

This means that if we total and compare the costs incurred between using the Machine Learning model and not using the Machine Learning model, we can save expenses by up to 29%. Of course, this doesn’t yet take into account the precision of our predictions if we don’t use machine learning. If we include that assessment, the savings ratio might be even higher.

--

--

Ridho Aryo Pratama

Data scientist from Indonesia. Teaching beginners about data science. Sharing knowledge, writing enthusiast, and avid gamer. Let's connect and learn together!