Telecommunication Company Customer Churn Prediction in Order to Minimizing Company’s Cost
Introduction
In a telco company, there are two costs known as Acquisition Cost and Retention Cost. Acquisition Cost is the expense for a company to acquire new customers. Meanwhile, Retention Cost is the spending for the company to retain existing customers.
Due to human limitations, we are often wrong to predict which customers will churn and which customers will retain. So that the allocation of funds can be wrong so that the funds issued become larger.
Furthermore, according to various sources, the acquisition cost is seven times higher than the retention cost. If we make a mistake in predicting that the customer will continue using our product, but in reality, they churn, we end up spending more than necessary.
Objectives
In this project, our goal is to create a smart computer program using Machine Learning that can guess which customers might decide to stop using our services and which ones will continue using them. The idea is to make this guessing game as accurate as possible. Why? So we can be really smart about where we spend our money on trying to keep customers and where we don’t need to spend as much.
Objective 1
Identify the factors influencing customer churn.
Objective 2
Create a machine learning model capable of predicting churn.
Objective 3
Minimize costs.
GitHub Repo
Visit this link to get more details about the code and presentation: https://github.com/ridhoaryo/TelcoCustomerChurn
Assumption
We will attempt to make initial assumptions about retention cost and acquisition cost. For this case, we will consider the retention cost to be $10, while the acquisition cost is $70 (7 times larger).
Retention Cost
Retention cost, also known as Customer Retention Cost, refers to the expenses a company incurs to retain its existing customers and prevent them from leaving or churning.
Acquisition Cost
Acquisition cost, often referred to as Customer Acquisition Cost (CAC), is the expense incurred by a company to acquire new customers. It represents the cost associated with convincing a potential customer to make their first purchase or start using a company’s products or services.
The Dataset
The dataset we are using is 7043 observations with 33 fictional telco companies that variables provided home phone and Internet services to 7043 customers in California in Q3. Visit this link to go to the dataset.
Data Understanding and Preparation
From the dataframe above, there are several columns that have only one unique value, namely the columns [`Count`, `Country`, `State`]. Additionally, I will not use the `CustomerID` column because `customerID` does not determine the probability of someone churning or not.
`Zip code`, `Lat Long`, `Latitude`, and `Longitude` will also be deleted. I will not use them to build the Machine Learning model.
I will also remove `Churn Score` and `Churn Reason` since obtaining a customer’s Churn Score and Churn Reason data before they actually churn or leave is impossible in the future. Moreover, it would leak information when we build the model.
Data Analysis
Churn Proportion
From the pie chart above, we can see that, 26.54% of the customers in this dataset are labelled as churn customers. This indicates that there is an imbalance between the number of customers leaving and those staying. While some might think that this dataset needs to be resampled or, in other words, the labels should be “balanced,” I am hesitant to take that step at this point. Instead, I will conduct further exploration to gain more insights.
Churn by Categorical Features
Before we delve further into the realm of machine learning modelling, we will try to examine the impact of each customer’s demographic and behavioral feature on their churn status.
Churn by Gender
It can be observed that the probability of churn based on gender does not differ significantly between men and women.
Churn by Senior Citizen Status
The elderly have nearly twice the chance of churning compared to the younger generation.
Churn by Partner
Customers without a partner also have a tendency to churn almost twice as much as those who have a partner.
Churn by Partner
Customers who have children (or other dependents) have a 5 times greater chance of churning compared to customers who do not have them.
Churn by Internet Service Subscription
Customers who use a fiber optic connection for their internet service subscription have nearly 6 times greater chances of churning compared to those who do not subscribe to internet services.
Churn by Device Protection
Customers who do not subscribe to device protection have a tendency to churn more than 5 times that of those who use device protection.
Churn by Contract
Customers with month-to-month contracts are 15 times more likely to churn than customers with 2-year or annual contracts.
Churn by Payment Method
Customers who use electronic checks as their payment method are approximately 3 times more likely to churn compared to other payment methods.
Churn by Numerical Features
Now, we shift our focus to numerical features. We will try to examine the relationship between these numerical features and churn status.
Tenure Months Distribution by Churn
We are observing the distribution of tenure months in relation to churn. We will also use an assessment using the Wilcoxon test to examine the relationship between the feature and the target. The ‘tenure months’ column does not have a strong enough relationship with churn.
Total Charges Distribution by Churn
Turns out, the ‘total charges’ column also does not have a strong enough relationship with churn.
CLTV Distribution by Churn
Turns out, the ‘CLTV’ column also does not have a strong enough relationship with churn.
Checking Multicollinearity
In addition to examining the relationship between Exogenous Variables and the target, we will also look at the potential multicollinearity among numerical variables. We will use correlation and VIF.
VIF, or Variance Inflation Factor, is a statistic used to measure how much the variance of an estimated regression coefficient increases when your predictors (independent variables) are correlated. In simpler terms, it helps us understand if there is a problem of multicollinearity in our data, which occurs when two or more independent variables in a regression model are highly correlated with each other. High VIF values (usually above 10) indicate a high degree of multicollinearity, which can make it difficult to interpret the effects of individual predictors in a regression model. It’s important to keep VIF values low to ensure the reliability of your regression analysis.
Surprisingly, all features have high VIF values. The columns ‘Tenure Months,’ ‘Monthly Charges,’ and ‘Total Charges’ are interconnected. We can choose one of them. Here, we will use the ‘Total Charges’ column.
Modelling Strategies
For the modelling stage, this churn prediction will focus on as much as possible in predicting which customers will churn. As we know, if this model predicts ‘retain’ for a customer that will actually ‘churn’, then we will lose the customer. According to some opinions, the cost of customer acquisition is 7 times greater than the cost of retaining the customer.
So that in modeling this time we will focus on the Recall Score. We will try to use the PR (Precision Recall) Curve to find the optimum threshold.
The strategy is, first, to divide the data into two parts: Training Data (80%) and Testing Data (20%). From the Training Data, we will further split it into two parts: Training Data and Validation Data. The Validation Data will be used to assess the model’s predictive abilities.
The result of this splitting step is, that the training set has a 5634 x 19 shape and the testing set has a 1409 x 19 shape.
Initiating Machine Learning Pipeline
The second step involves creating a Pipeline that allows us to process the data up to the model fitting stage. Why use a Pipeline? This is to avoid data leakage.
Data leakage occurs when information from the testing dataset “leaks” into the training process. In other words, the model unintentionally gains access to information it shouldn’t have during training, which can lead to unrealistically optimistic performance estimates.
Deterimine Metric Evaluation
We will attempt to optimize theRecall value. Referring to our main TP+FN objective, which is to minimize
costs, and considering that the largest cost comes from acquisition cost, we aim to minimize the possibility of customers who are likely to churn but do not receive any form of compensation to stay (retain cost).
Cross Validation Step
Next, we will perform cross-validation to assess how stable the model is when learning from gradually changing training data. From this stage, we can obtain an average score of 0.792 or 79.2% for Recall.
After that, we will proceed with the training step and we obtain the Recall score, which is not too far from the cross-validation result, at 0.799 or 79.9% Recall.
Threshold Optimization
Next, we will proceed with threshold optimization, which involves finding the ideal balance between precision and recall in a classification problem. By adjusting the threshold, we can control the trade-off between correctly identifying positive cases (churn, in this case) and minimizing false positives.
After optimization, we achieved an optimal threshold of approximately 0.31, which significantly boosted our recall to 0.89 or 89%.
Conclusion
After obtaining the best results, the challenge now is how to present it to stakeholders. Here’s how to interpret it:
Recall Interpretation
If we have 1000 customers with the potential to churn, then we will accurately predict that 890 customers will churn. Thus, we can allocate a retaining cost of $8,900 (assuming the initial retention cost is $10 per customer). However, there are 110 customers we did not predict accurately, so we need to spend an acquisition cost to replace those lost customers, which amounts to $7,700 (assuming the initial acquisition cost is $70 per customer). The total cost we incur is approximately $16,600.
Now, what if we didn’t use machine learning?
Without using machine learning, the most optimistic guess would have an accuracy of 50%. This means we would accurately guess 500 customers who will churn. Thus, we can allocate a retaining cost of $5,000 (assuming the initial retention cost is $10 per customer). However, there are 500 customers we did not guess correctly, so we need to spend an acquisition cost to replace those lost customers, which amounts to $35,000 (assuming the initial acquisition cost is $70 per customer). The total cost we incur is approximately $40,000.
Precision Effect
Gentle reminder for you, don’t forget we still have Precision to be interpreted. How are we going to do this? First thing first, I would like to remind you about the Confusion Matrix we produce earlier.
From here, we can infer that if we have 890 correct predictions of churn (True Positives) but only a 43% Precision, it means that we have approximately 1180 predictions that are false positives (FP). In total, we have 2070 churn predictions.
Out of the 2070 customers we predicted as churn, we only accurately predicted 890 who actually churned. The rest are customers who were retained but predicted as churn. So, we still need to allocate retention cost for those retaining customers. In other words, out of the $20,700 we spent on retention cost, $8,900 hit the mark, and the rest did not.
Cost Effectiveness
This means that if we total and compare the costs incurred between using the Machine Learning model and not using the Machine Learning model, we can save expenses by up to 29%. Of course, this doesn’t yet take into account the precision of our predictions if we don’t use machine learning. If we include that assessment, the savings ratio might be even higher.