Customer Churn Analysis
Problem Statement:
Domain Topic Telecom Churn Analysis Telecom Churn (loss of customers to competition) is a problem for telecom companies because it is expensive to acquire a new customer and companies want to retain their existing customers. Most telecom companies suffer from voluntary churn.
Customer churn is when a company’s customers stop doing business with that company. Businesses are very keen on measuring churn because keeping an existing customer is far less expensive than acquiring a new customer. New business involves working leads through a sales funnel, using marketing and sales budgets to gain additional customers. Existing customers will often have a higher volume of service consumption and can generate additional customer referrals.
Customer retention can be achieved with good customer service and products. But the most effective way for a company to prevent attrition of customers is to truly know them. The vast volumes of data collected about customers can be used to build churn prediction models. Knowing who is most likely to defect means that a company can priorities focused marketing efforts on that subset of their customer base.
Preventing customer churn is critically important to the telecommunications sector, as the barriers to entry for switching services are so low.
I will examine customer data from IBM Sample Data Sets with the aim of building and comparing several customer churn prediction models.
DATA ANALYSIS/EDA
Data comprises of Services — which services the customer subscribed to (internet, phone, cable, etc.), Monthly bill total, Tenure — How long they had been a customer, Basic demographic info — whether they were elderly, had dependents, etc.
let us do the data analysis one by one
Gender
As the gender indicate , that how much mostly , in telecom industry the customer has the insured sex count and from my analysis
Male 3555
Female 3488countmale = len(df[df.gender == 'Male'])
countfemale = len(df[df.gender == 'Female'])
print("Percentage of male: {:.2f}%".format((countmale / (len(df.gender))*100)))
print("Percentage of female: {:.2f}%".format((countfemale / (len(df.gender))*100)))Percentage of male: 50.48%
Percentage of female: 49.52%
it clearly shows that telecom industry does not depend upon gender and has almost equal ratio between male and female, only uniqueness in dataset
SENIOR-CITIZEN
it can be a possible exploratory analysis that senior citizen , if they are in count, can possible can leave doing business with the company. Businesses are very keen on measuring churn because keeping an existing customer is far less expensive than acquiring a new customer but as a senior citizen cannot that be possibly true for keeping customer for a long time
0 5901
1 1142count0 = len(df[df.SeniorCitizen == 0])count1 = len(df[df.SeniorCitizen == 1])print("Percentage of non senior citizen: {:.2f}%".format((count0 / (len(df.SeniorCitizen))*100)))print("Percentage of senior citizen: {:.2f}%".format((count1 / (len(df.SeniorCitizen))*100)))Percentage of non senior citizen: 83.79%
Percentage of senior citizen: 16.21%
maximum people are not senior citizen and only 16.2% people from data are senior citizen but still 16% have possibility of leaving company or not using much of services.
PARTNER
Partner Interconnect provides connectivity between your on-premises network and your Virtual Private Cloud (VPC) network through a supported service provider.
No 3641
Yes 3402countno = len(df[df.Partner == 'No'])countyes = len(df[df.Partner == 'Yes'])print("Percentage of no: {:.2f}%".format((countno / (len(df.Partner))*100)))Print("Percentage of yes: {:.2f}%".format((countyes / (len(df.Partner))*100)))Percentage of no: 51.70%
Percentage of yes: 48.30%
There is are balanced ratio between the partnered and indivudal connections customers. It means churn is independent of connection type
DEPENDENTS
It surely means whether the customer is likely to be dependent or not, Customer without dependents are four times more likely to churn. Senior citizens are three times less likely to churn
No 4933
Yes 2110
countno = len(df[df.Dependents == 'No'])countyes = len(df[df.Dependents == 'Yes'])print("Percentage of no: {:.2f}%".format((countno / (len(df.Dependents))*100)))print("Percentage of yes: {:.2f}%".format((countyes / (len(df.Dependents))*100)))Percentage of no: 70.04%
Percentage of yes: 29.96%
Most of the customers are independent. and almost 30 percent are dependent upon others, as possible explanation can be , that maximum people earn by themselves and pay bills, can only count students and senior citizen in this 30 percent
TENURE
Tenure refers to the number of months that a customer has subscribed for. The tenure for a churning customer indicates the number of months that they spent as a subscriber.
as exploratory description of tenure is
count 7043.000000
mean 32.371149
std 24.559481
min 0.000000
25% 9.000000
50% 29.000000
75% 55.000000
max 72.000000
so on an average, tenure of a customer can be taken as 32 which is good , so even after this much tenure , if a customer is leaving, may be there can be lot of other services that encouraging customer to leave.
TENURE/CHURN (BI-vairiate analysis)
From he analysis, the customers with less tenures mostly stop the services.
The regular customers are distributed uniformly, have normal distribution
PHONE SERVICE
A phone service is a public utility service focused on voice communications delivered by a phone company to residential and commercial clients.
Yes 6361
No 682countno = len(df[df.PhoneService == 'No'])countyes = len(df[df.PhoneService == 'Yes'])print("Percentage of no: {:.2f}%".format((countno / (len(df.PhoneService))*100)))print("Percentage of yes: {:.2f}%".format((countyes / (len(df.PhoneService))*100)))Percentage of no: 9.68%
Percentage of yes: 90.32%
maximum people have almost 90 % have phone service and 10 percent doesn’t have this service. Out of 7000, 700 don’t have phone services, might be using other services only for once or twice.
MULTIPLE LINES
A Multiline is a group of analogue telephone lines with one telephone number. The first line is known as the main exchange line, whilst the extra lines are known as secondary lines. The additional lines are added to the main exchange telephone line so there’s no need for a new number.
No 3390
Yes 2971
No phone service 682countno = len(df[df.MultipleLines == 'No'])countyes = len(df[df.MultipleLines == 'Yes'])
print("Percentage of no: {:.2f}%".format((countno / (len(df.MultipleLines))*100)))print("Percentage of yes: {:.2f}%".format((countyes / (len(df.MultipleLines))*100)))countnoservice = len(df[df.MultipleLines == 'No phone service'])print("Percentage of noservice: {:.2f}%".format((countnoservice / (len(df.MultipleLines))*100)))Percentage of no: 48.13%
Percentage of yes: 42.18%
Percentage of noservice: 9.68%
other than the people with no phone service at all, Customers equally use the multiline and single line phone services.
INTERNET SERVICES
Telecom service providers offer voice calling and text messaging directly, while internet services run on the top layer of the network as standalone services. The services are offered in a different context, even if they sometimes serve the same function for end
Fiber optic 3096
DSL 2421
No 1526countfibre = len(df[df.InternetService == 'Fibre optic'])countDSL = len(df[df.InternetService == 'DSL'])print("Percentage of fibre: {:.2f}%".format((countno / (len(df.InternetService))*100)))print("Percentage of DSl: {:.2f}%".format((countyes / (len(df.InternetService))*100)))countno = len(df[df.InternetService == 'No'])print("Percentage of no: {:.2f}%".format((countno / (len(df.InternetService))*100)))Percentage of fibre: 48.13%
Percentage of DSl: 42.18%
Percentage of no: 21.67%
Most of the customers have opted for Fibre Optic as it is the fasted and highest bandwidth connumication meduim.
Fibre optic being able to have a very high speed, the cost for the fibre optics is high, thus many of the custmers also prefer DSL i.e., Digital Subscriber Line.
Yet 21 percent of the customers prefer not have any internet connection.
ONLINE SECURITY
Online security refers to the body of technologies, processes, and practices designed to protect networks, devices, programs, and data from attack, damage, or unauthorized access. Online security may also be referred to as information technology security.
No 3498
Yes 2019
No internet service 1526countno = len(df[df.OnlineSecurity == 'No'])countyes = len(df[df.OnlineSecurity == 'Yes'])print("Percentage of no: {:.2f}%".format((countno / (len(df.OnlineSecurity))*100)))print("Percentage of yes: {:.2f}%".format((countyes / (len(df.OnlineSecurity))*100)))countnoservice = len(df[df.OnlineSecurity == 'No internet service'])print("Percentage of noservice: {:.2f}%".format((countnoservice / (len(df.OnlineSecurity))*100)))Percentage of no: 49.67%
Percentage of yes: 28.67%
Percentage of no service: 21.67%
only 21% people have online security, others dont have.
ONLINE BACKUP
A remote backup, online backup or cloud backup is where a copy of computer files or systems are stored in a secure offsite location. In the event of an IT failure where the original data is lost, the remotely stored data can be used to return operations to normal running.
countno = len(df[df.OnlineBackup == 'No'])countyes = len(df[df.OnlineBackup == 'Yes'])print("Percentage of no: {:.2f}%".format((countno / (len(df.OnlineBackup))*100)))print("Percentage of yes: {:.2f}%".format((countyes / (len(df.OnlineBackup))*100)))countnoservice = len(df[df.OnlineBackup == 'No internet service'])print("Percentage of noservice: {:.2f}%".format((countnoservice / (len(df.OnlineBackup))*100)))Percentage of no: 43.84%
Percentage of yes: 34.49%
Percentage of noservice: 21.67%
only 34% have online backup options with them, rest don’t have it.
‘OnlineSecurity’, ‘OnlineBackup’, ‘DeviceProtection’, ‘TechSupport’,’StreamingTV’, ‘StreamingMovies’
There above are the services on can avail only if they had availed the internet service.
the above services have almost same percentage as above two shown, shows that if the customer has internet services available, the only he can avail other services like ‘OnlineSecurity’, ‘OnlineBackup’, ‘DeviceProtection’, ‘TechSupport’,’StreamingTV’, ‘StreamingMovies’
ALL SERVICES ABOVE/MONTHLY CHARGES
From the analysis, its shown that the more services customer has, the more will be monthly be added, that is The monthly charges increase as the number of sub products increase
CONTRACT
If you are a telecom business or if you are contracting for telecommunication services, your contract is likely to be highly technical and possibly very detailed.
Month-to-month 3875
Two year 1695
One year 1473
As you can see from above count, Customer mostly prefer the month to month contract, possible options as for customer it is difficult to pay all amount all of a sudden for year or two.
PAYMENT METHODS
A customer can make payment using different payment methods that are supported by the service provider; for example, the customer can make payments using the payment methods such as cheque, credit card, debit card or wire transfers, or direct cash deposit
from analysis, electronic cheque is preferred by customers as they feel safe to pay for the services
PAYMENT METHOD VS TENURE
AS people with high tenure with company , customers do prefer automatic payments more.
MONTHLY CHARGES AND TOTAL CHARGES
as per analysis goes by, monthly charges for a customer from box plot , shows between 40–80/- and total charges aand maximum between 0–2000/- and then goes down with count.
INTERNET SERVICES /MONTHLY CHARGES
AS most people prefer month to month charges but monthly wise Fibre optics are costly than DSL
MULTIPLE LINES/MONTHLY CHARGES
AS most people prefer month to month charges but multiples lines make monthly charge costly
TOTAL CHARGES/CHURN
from the analysis, total charges/monthly charges are almost vice versa and uniformly distributed to other features
CHURN
Churn rate is the percentage of subscribers to a service that discontinue their subscription to that service in a given time period. … Churn rate is an important consideration in the telephone and cell phone services industry.
No 5163
Yes 1869countno = len(df[df.Churn == 'No'])countyes = len(df[df.Churn == 'Yes'])print("Percentage of no: {:.2f}%".format((countno / (len(df.Churn))*100)))print("Percentage of yes: {:.2f}%".format((countyes / (len(df.Churn))*100)))Percentage of no: 73.42%
Percentage of yes: 26.58%
so company is loosing almost 26% customers.
PRE PROCESSING
FROM CORRELATION
I can conclude that To avoid unstable estimates of coeffiecients in the model, need to drop the ‘Total Charges’ variable during regression process, as it is highly correlated to both ‘Tenure’ and ‘Monthly Charges’.
It can be seen from correlation plot that, Contract_Month-to-month, Online security_No, Tech Support_No…etc. are positively correlated with Churn. While, on the other end of the plot, tenure, Contract_Two year, InternetService_No…etc. are negatively correlated with Churn.
Interestingly, services such as Online security, Streaming TV, OnlineBackup, TechSupport…, etc. with InternetService_No seem to be negatively related to Churn. Need to explore the patterns more for the above correlations below before modelling and identifying the important variables.
NULL VALUES AND REDUNDANT COLUMNS
so over all we have almost 7043records and 21 columns
no null values and got rid of redundant columns as customer ID.
OUTLIERS
From box plot, there were no outliers as well.
SKEW STATS
As data was positively skewed, so square root tranformation worked in a positive way for balancing.
LABEL ENCODING
as maximum data was categorical so from Label encoder, dataset been encoded and for object data types as data was mostly categorical, dummies were created.
from sklearn.preprocessing import LabelEncoder
le=LabelEncoder()
df['Churn']= le.fit_transform(df['Churn'])df=pd.get_dummies(df,columns=df.columns[df.dtypes=="object"],drop_first=True)
SCALING
As dataset need to get scaled to get maximum accuracy, standard scaler was taken help to get out data set properly scaled before building model
MODEL BUILDING
from train test split , we got the shapes of dataset as
((4922, 30), (4922,), (2110, 30), (2110,))
In [476]:
ALGORITHMS
KNN=KNeighborsClassifier()
SV=SVC()
LR=LogisticRegression()
DT=DecisionTreeClassifier(random_state=60)
GNB=GaussianNB()
RFC=RandomForestClassifier(n_estimators=200,random_state=60)
GBC=GradientBoostingClassifier(n_estimators=200,random_state=60)
ABC=AdaBoostClassifier(random_state=60)
ETC=ExtraTreesClassifier(random_state=60)
FINAL CONCLUSION
KNeighborsClassifier()
Accuracy_score = 0.7644549763033175
Cross_Val_Score = 0.7677746750937541
roc_auc_score = 0.6912014996737588
classification_report
precision recall f1-score support
0 0.83 0.85 0.84 1549
1 0.56 0.53 0.55 561
accuracy 0.76 2110
macro avg 0.70 0.69 0.69 2110
weighted avg 0.76 0.76 0.76 2110
[[1313 236]
[ 261 300]]
AxesSubplot(0.125,0.808774;0.62x0.0712264)
---------------------> SVC <----------------------
SVC()
Accuracy_score = 0.7824644549763033
Cross_Val_Score = 0.8000567777705936
roc_auc_score = 0.6517360979252902
classification_report
precision recall f1-score support
0 0.80 0.93 0.86 1549
1 0.66 0.37 0.48 561
accuracy 0.78 2110
macro avg 0.73 0.65 0.67 2110
weighted avg 0.77 0.78 0.76 2110
[[1442 107]
[ 352 209]]
AxesSubplot(0.125,0.808774;0.62x0.0712264)
---------------------> LogisticRegression <----------------------
LogisticRegression()
Accuracy_score = 0.7971563981042654
Cross_Val_Score = 0.8081628653174706
roc_auc_score = 0.7089261198933473
classification_report
precision recall f1-score support
0 0.84 0.90 0.87 1549
1 0.65 0.52 0.58 561
accuracy 0.80 2110
macro avg 0.74 0.71 0.72 2110
weighted avg 0.79 0.80 0.79 2110
[[1390 159]
[ 269 292]]
AxesSubplot(0.125,0.808774;0.62x0.0712264)
---------------------> DecisionTreeClassifier <----------------------
DecisionTreeClassifier(random_state=45)
Accuracy_score = 0.7393364928909952
Cross_Val_Score = 0.7272458538083539
roc_auc_score = 0.6712513046770442
classification_report
precision recall f1-score support
0 0.83 0.82 0.82 1549
1 0.51 0.53 0.52 561
accuracy 0.74 2110
macro avg 0.67 0.67 0.67 2110
weighted avg 0.74 0.74 0.74 2110
[[1265 284]
[ 266 295]]
AxesSubplot(0.125,0.808774;0.62x0.0712264)
---------------------> GaussianNB <----------------------
GaussianNB()
Accuracy_score = 0.6526066350710901
Cross_Val_Score = 0.6626806381740591
roc_auc_score = 0.7218969400072959
classification_report
precision recall f1-score support
0 0.92 0.57 0.71 1549
1 0.43 0.87 0.57 561
accuracy 0.65 2110
macro avg 0.67 0.72 0.64 2110
weighted avg 0.79 0.65 0.67 2110
[[889 660]
[ 73 488]]
AxesSubplot(0.125,0.808774;0.62x0.0712264)
---------------------> RandomForestClassifier <----------------------
RandomForestClassifier(random_state=45)
Accuracy_score = 0.7914691943127962
Cross_Val_Score = 0.7916647808095176
roc_auc_score = 0.6982309327275719
classification_report
precision recall f1-score support
0 0.83 0.90 0.86 1549
1 0.64 0.50 0.56 561
accuracy 0.79 2110
macro avg 0.73 0.70 0.71 2110
weighted avg 0.78 0.79 0.78 2110
[[1390 159]
[ 281 280]]
AxesSubplot(0.125,0.808774;0.62x0.0712264)
---------------------> GradientBoostingClassifier <----------------------
GradientBoostingClassifier(random_state=45)
Accuracy_score = 0.7995260663507109
Cross_Val_Score = 0.8061709960558645
roc_auc_score = 0.7122454944769152
classification_report
precision recall f1-score support
0 0.84 0.90 0.87 1549
1 0.65 0.53 0.58 561
accuracy 0.80 2110
macro avg 0.75 0.71 0.73 2110
weighted avg 0.79 0.80 0.79 2110
[[1392 157]
[ 266 295]]
AxesSubplot(0.125,0.808774;0.62x0.0712264)
---------------------> ExtraTreesClassifier <----------------------
ExtraTreesClassifier(random_state=45)
Accuracy_score = 0.776303317535545
Cross_Val_Score = 0.7817074550627181
roc_auc_score = 0.6810799676405571
classification_report
precision recall f1-score support
0 0.82 0.88 0.85 1549
1 0.60 0.48 0.53 561
accuracy 0.78 2110
macro avg 0.71 0.68 0.69 2110
weighted avg 0.76 0.78 0.77 2110
[[1370 179]
[ 293 268]]
AxesSubplot(0.125,0.808774;0.62x0.0712264)
---------------------> AdaBoostClassifier <----------------------
AdaBoostClassifier(random_state=45)
Accuracy_score = 0.7971563981042654
Cross_Val_Score = 0.8050307933531619
roc_auc_score = 0.714610886904207
classification_report
precision recall f1-score support
0 0.84 0.89 0.87 1549
1 0.64 0.54 0.59 561
accuracy 0.80 2110
macro avg 0.74 0.71 0.73 2110
weighted avg 0.79 0.80 0.79 2110
[[1380 169]
[ 259 302]]
- After training n number of models, we can finally conclude that the decision tree is working the best in this case.
- from above reports generated →
- Logistic_reg(Algo)
- 0.811662(Train_acc)
- 0.797156(Test_acc)
- 0.797156(Precision)
- 0.797156(Recall)
- 0.797156(F1_score)
- 0.807600(Avg_CV)
- Logistic Regresson gives us the best ‘Cross validation Score’
- also We have Logistic Regression as the winner beating Gradient Boost with a margin of 0.005%.
FINAL CONFUSION MATRIX
CLASSIFICATION REPORTS AS →
precision recall f1-score support
0 0.84 0.90 0.87 1549
1 0.65 0.52 0.58 561
accuracy 0.80 2110
macro avg 0.74 0.71 0.72 2110
weighted avg 0.79 0.80 0.79 2110
CONCLUDING REMARKS
in above model it is summarised that relation analysis between features and target with best visualisation
also best model has been predicted with best hyperparameter tuning and best score was derived and made 81% accuracy
from the over conclusions made →
people having very high tenure or very less tenure are leaving company
people don’t have the phone services aren’t enjoying other services , so probably customer is leaving , here company can work upon new schemes so that customer can get attract towards services.
there is no issue with monthly billing or method of payment , but maximum customers cant afford for two year subscription , company should come up with again new schemes and offers.
after model building, Logistic Regression algorithm looks best for the telecom customer churn dataset, which will predict the churn analysis.
Thank You
Author- SARANSH SHARDA