Customer Churn Analysis

Saransh Sharda
11 min readMay 31, 2021

Problem Statement:

Domain Topic Telecom Churn Analysis Telecom Churn (loss of customers to competition) is a problem for telecom companies because it is expensive to acquire a new customer and companies want to retain their existing customers. Most telecom companies suffer from voluntary churn.

Customer churn is when a company’s customers stop doing business with that company. Businesses are very keen on measuring churn because keeping an existing customer is far less expensive than acquiring a new customer. New business involves working leads through a sales funnel, using marketing and sales budgets to gain additional customers. Existing customers will often have a higher volume of service consumption and can generate additional customer referrals.

Customer retention can be achieved with good customer service and products. But the most effective way for a company to prevent attrition of customers is to truly know them. The vast volumes of data collected about customers can be used to build churn prediction models. Knowing who is most likely to defect means that a company can priorities focused marketing efforts on that subset of their customer base.

Preventing customer churn is critically important to the telecommunications sector, as the barriers to entry for switching services are so low.

I will examine customer data from IBM Sample Data Sets with the aim of building and comparing several customer churn prediction models.

DATA ANALYSIS/EDA

Data comprises of Services — which services the customer subscribed to (internet, phone, cable, etc.), Monthly bill total, Tenure — How long they had been a customer, Basic demographic info — whether they were elderly, had dependents, etc.

let us do the data analysis one by one

Gender

As the gender indicate , that how much mostly , in telecom industry the customer has the insured sex count and from my analysis

Male      3555
Female 3488
countmale = len(df[df.gender == 'Male'])
countfemale = len(df[df.gender == 'Female'])
print("Percentage of male: {:.2f}%".format((countmale / (len(df.gender))*100)))
print("Percentage of female: {:.2f}%".format((countfemale / (len(df.gender))*100)))
Percentage of male: 50.48%
Percentage of female: 49.52%

it clearly shows that telecom industry does not depend upon gender and has almost equal ratio between male and female, only uniqueness in dataset

SENIOR-CITIZEN

it can be a possible exploratory analysis that senior citizen , if they are in count, can possible can leave doing business with the company. Businesses are very keen on measuring churn because keeping an existing customer is far less expensive than acquiring a new customer but as a senior citizen cannot that be possibly true for keeping customer for a long time

0    5901
1 1142
count0 = len(df[df.SeniorCitizen == 0])count1 = len(df[df.SeniorCitizen == 1])print("Percentage of non senior citizen: {:.2f}%".format((count0 / (len(df.SeniorCitizen))*100)))print("Percentage of senior citizen: {:.2f}%".format((count1 / (len(df.SeniorCitizen))*100)))Percentage of non senior citizen: 83.79%
Percentage of senior citizen: 16.21%

maximum people are not senior citizen and only 16.2% people from data are senior citizen but still 16% have possibility of leaving company or not using much of services.

PARTNER

Partner Interconnect provides connectivity between your on-premises network and your Virtual Private Cloud (VPC) network through a supported service provider.

No     3641
Yes 3402
countno = len(df[df.Partner == 'No'])countyes = len(df[df.Partner == 'Yes'])print("Percentage of no: {:.2f}%".format((countno / (len(df.Partner))*100)))Print("Percentage of yes: {:.2f}%".format((countyes / (len(df.Partner))*100)))Percentage of no: 51.70%
Percentage of yes: 48.30%

There is are balanced ratio between the partnered and indivudal connections customers. It means churn is independent of connection type

DEPENDENTS

It surely means whether the customer is likely to be dependent or not, Customer without dependents are four times more likely to churn. Senior citizens are three times less likely to churn

No     4933
Yes 2110
countno = len(df[df.Dependents == 'No'])countyes = len(df[df.Dependents == 'Yes'])print("Percentage of no: {:.2f}%".format((countno / (len(df.Dependents))*100)))print("Percentage of yes: {:.2f}%".format((countyes / (len(df.Dependents))*100)))Percentage of no: 70.04%
Percentage of yes: 29.96%

Most of the customers are independent. and almost 30 percent are dependent upon others, as possible explanation can be , that maximum people earn by themselves and pay bills, can only count students and senior citizen in this 30 percent

TENURE

Tenure refers to the number of months that a customer has subscribed for. The tenure for a churning customer indicates the number of months that they spent as a subscriber.

as exploratory description of tenure is

count    7043.000000
mean 32.371149
std 24.559481
min 0.000000
25% 9.000000
50% 29.000000
75% 55.000000
max 72.000000

so on an average, tenure of a customer can be taken as 32 which is good , so even after this much tenure , if a customer is leaving, may be there can be lot of other services that encouraging customer to leave.

TENURE/CHURN (BI-vairiate analysis)

From he analysis, the customers with less tenures mostly stop the services.

The regular customers are distributed uniformly, have normal distribution

PHONE SERVICE

A phone service is a public utility service focused on voice communications delivered by a phone company to residential and commercial clients.

Yes    6361
No 682
countno = len(df[df.PhoneService == 'No'])countyes = len(df[df.PhoneService == 'Yes'])print("Percentage of no: {:.2f}%".format((countno / (len(df.PhoneService))*100)))print("Percentage of yes: {:.2f}%".format((countyes / (len(df.PhoneService))*100)))Percentage of no: 9.68%
Percentage of yes: 90.32%

maximum people have almost 90 % have phone service and 10 percent doesn’t have this service. Out of 7000, 700 don’t have phone services, might be using other services only for once or twice.

MULTIPLE LINES

A Multiline is a group of analogue telephone lines with one telephone number. The first line is known as the main exchange line, whilst the extra lines are known as secondary lines. The additional lines are added to the main exchange telephone line so there’s no need for a new number.

No                  3390
Yes 2971
No phone service 682
countno = len(df[df.MultipleLines == 'No'])countyes = len(df[df.MultipleLines == 'Yes'])
print("Percentage of no: {:.2f}%".format((countno / (len(df.MultipleLines))*100)))print("Percentage of yes: {:.2f}%".format((countyes / (len(df.MultipleLines))*100)))countnoservice = len(df[df.MultipleLines == 'No phone service'])print("Percentage of noservice: {:.2f}%".format((countnoservice / (len(df.MultipleLines))*100)))Percentage of no: 48.13%
Percentage of yes: 42.18%
Percentage of noservice: 9.68%

other than the people with no phone service at all, Customers equally use the multiline and single line phone services.

INTERNET SERVICES

Telecom service providers offer voice calling and text messaging directly, while internet services run on the top layer of the network as standalone services. The services are offered in a different context, even if they sometimes serve the same function for end

Fiber optic    3096
DSL 2421
No 1526
countfibre = len(df[df.InternetService == 'Fibre optic'])countDSL = len(df[df.InternetService == 'DSL'])print("Percentage of fibre: {:.2f}%".format((countno / (len(df.InternetService))*100)))print("Percentage of DSl: {:.2f}%".format((countyes / (len(df.InternetService))*100)))countno = len(df[df.InternetService == 'No'])print("Percentage of no: {:.2f}%".format((countno / (len(df.InternetService))*100)))Percentage of fibre: 48.13%
Percentage of DSl: 42.18%
Percentage of no: 21.67%

Most of the customers have opted for Fibre Optic as it is the fasted and highest bandwidth connumication meduim.

Fibre optic being able to have a very high speed, the cost for the fibre optics is high, thus many of the custmers also prefer DSL i.e., Digital Subscriber Line.

Yet 21 percent of the customers prefer not have any internet connection.

ONLINE SECURITY

Online security refers to the body of technologies, processes, and practices designed to protect networks, devices, programs, and data from attack, damage, or unauthorized access. Online security may also be referred to as information technology security.

No                     3498
Yes 2019
No internet service 1526
countno = len(df[df.OnlineSecurity == 'No'])countyes = len(df[df.OnlineSecurity == 'Yes'])print("Percentage of no: {:.2f}%".format((countno / (len(df.OnlineSecurity))*100)))print("Percentage of yes: {:.2f}%".format((countyes / (len(df.OnlineSecurity))*100)))countnoservice = len(df[df.OnlineSecurity == 'No internet service'])print("Percentage of noservice: {:.2f}%".format((countnoservice / (len(df.OnlineSecurity))*100)))Percentage of no: 49.67%
Percentage of yes: 28.67%
Percentage of no service: 21.67%

only 21% people have online security, others dont have.

ONLINE BACKUP

A remote backup, online backup or cloud backup is where a copy of computer files or systems are stored in a secure offsite location. In the event of an IT failure where the original data is lost, the remotely stored data can be used to return operations to normal running.

countno = len(df[df.OnlineBackup == 'No'])countyes = len(df[df.OnlineBackup ==  'Yes'])print("Percentage of no: {:.2f}%".format((countno / (len(df.OnlineBackup))*100)))print("Percentage of yes: {:.2f}%".format((countyes / (len(df.OnlineBackup))*100)))countnoservice = len(df[df.OnlineBackup ==  'No internet service'])print("Percentage of noservice: {:.2f}%".format((countnoservice / (len(df.OnlineBackup))*100)))Percentage of no: 43.84%
Percentage of yes: 34.49%
Percentage of noservice: 21.67%

only 34% have online backup options with them, rest don’t have it.

‘OnlineSecurity’, ‘OnlineBackup’, ‘DeviceProtection’, ‘TechSupport’,’StreamingTV’, ‘StreamingMovies’

There above are the services on can avail only if they had availed the internet service.

the above services have almost same percentage as above two shown, shows that if the customer has internet services available, the only he can avail other services like ‘OnlineSecurity’, ‘OnlineBackup’, ‘DeviceProtection’, ‘TechSupport’,’StreamingTV’, ‘StreamingMovies’

ALL SERVICES ABOVE/MONTHLY CHARGES

From the analysis, its shown that the more services customer has, the more will be monthly be added, that is The monthly charges increase as the number of sub products increase

CONTRACT

If you are a telecom business or if you are contracting for telecommunication services, your contract is likely to be highly technical and possibly very detailed.

Month-to-month    3875
Two year 1695
One year 1473

As you can see from above count, Customer mostly prefer the month to month contract, possible options as for customer it is difficult to pay all amount all of a sudden for year or two.

PAYMENT METHODS

A customer can make payment using different payment methods that are supported by the service provider; for example, the customer can make payments using the payment methods such as cheque, credit card, debit card or wire transfers, or direct cash deposit

from analysis, electronic cheque is preferred by customers as they feel safe to pay for the services

PAYMENT METHOD VS TENURE

AS people with high tenure with company , customers do prefer automatic payments more.

MONTHLY CHARGES AND TOTAL CHARGES

as per analysis goes by, monthly charges for a customer from box plot , shows between 40–80/- and total charges aand maximum between 0–2000/- and then goes down with count.

INTERNET SERVICES /MONTHLY CHARGES

AS most people prefer month to month charges but monthly wise Fibre optics are costly than DSL

MULTIPLE LINES/MONTHLY CHARGES

AS most people prefer month to month charges but multiples lines make monthly charge costly

TOTAL CHARGES/CHURN

from the analysis, total charges/monthly charges are almost vice versa and uniformly distributed to other features

CHURN

Churn rate is the percentage of subscribers to a service that discontinue their subscription to that service in a given time period. … Churn rate is an important consideration in the telephone and cell phone services industry.

No     5163
Yes 1869
countno = len(df[df.Churn == 'No'])countyes = len(df[df.Churn == 'Yes'])print("Percentage of no: {:.2f}%".format((countno / (len(df.Churn))*100)))print("Percentage of yes: {:.2f}%".format((countyes / (len(df.Churn))*100)))Percentage of no: 73.42%
Percentage of yes: 26.58%

so company is loosing almost 26% customers.

PRE PROCESSING

FROM CORRELATION

I can conclude that To avoid unstable estimates of coeffiecients in the model, need to drop the ‘Total Charges’ variable during regression process, as it is highly correlated to both ‘Tenure’ and ‘Monthly Charges’.

It can be seen from correlation plot that, Contract_Month-to-month, Online security_No, Tech Support_No…etc. are positively correlated with Churn. While, on the other end of the plot, tenure, Contract_Two year, InternetService_No…etc. are negatively correlated with Churn.

Interestingly, services such as Online security, Streaming TV, OnlineBackup, TechSupport…, etc. with InternetService_No seem to be negatively related to Churn. Need to explore the patterns more for the above correlations below before modelling and identifying the important variables.

NULL VALUES AND REDUNDANT COLUMNS

so over all we have almost 7043records and 21 columns

no null values and got rid of redundant columns as customer ID.

OUTLIERS

From box plot, there were no outliers as well.

SKEW STATS

As data was positively skewed, so square root tranformation worked in a positive way for balancing.

LABEL ENCODING

as maximum data was categorical so from Label encoder, dataset been encoded and for object data types as data was mostly categorical, dummies were created.

from sklearn.preprocessing import LabelEncoder
le=LabelEncoder()
df['Churn']= le.fit_transform(df['Churn'])
df=pd.get_dummies(df,columns=df.columns[df.dtypes=="object"],drop_first=True)

SCALING

As dataset need to get scaled to get maximum accuracy, standard scaler was taken help to get out data set properly scaled before building model

MODEL BUILDING

from train test split , we got the shapes of dataset as

((4922, 30), (4922,), (2110, 30), (2110,))

In [476]:

ALGORITHMS

KNN=KNeighborsClassifier()
SV=SVC()
LR=LogisticRegression()
DT=DecisionTreeClassifier(random_state=60)
GNB=GaussianNB()
RFC=RandomForestClassifier(n_estimators=200,random_state=60)
GBC=GradientBoostingClassifier(n_estimators=200,random_state=60)
ABC=AdaBoostClassifier(random_state=60)
ETC=ExtraTreesClassifier(random_state=60)

FINAL CONCLUSION

KNeighborsClassifier()


Accuracy_score = 0.7644549763033175


Cross_Val_Score = 0.7677746750937541


roc_auc_score = 0.6912014996737588


classification_report
precision recall f1-score support

0 0.83 0.85 0.84 1549
1 0.56 0.53 0.55 561

accuracy 0.76 2110
macro avg 0.70 0.69 0.69 2110
weighted avg 0.76 0.76 0.76 2110



[[1313 236]
[ 261 300]]


AxesSubplot(0.125,0.808774;0.62x0.0712264)



---------------------> SVC <----------------------


SVC()


Accuracy_score = 0.7824644549763033


Cross_Val_Score = 0.8000567777705936


roc_auc_score = 0.6517360979252902


classification_report
precision recall f1-score support

0 0.80 0.93 0.86 1549
1 0.66 0.37 0.48 561

accuracy 0.78 2110
macro avg 0.73 0.65 0.67 2110
weighted avg 0.77 0.78 0.76 2110



[[1442 107]
[ 352 209]]


AxesSubplot(0.125,0.808774;0.62x0.0712264)



---------------------> LogisticRegression <----------------------


LogisticRegression()


Accuracy_score = 0.7971563981042654


Cross_Val_Score = 0.8081628653174706


roc_auc_score = 0.7089261198933473


classification_report
precision recall f1-score support

0 0.84 0.90 0.87 1549
1 0.65 0.52 0.58 561

accuracy 0.80 2110
macro avg 0.74 0.71 0.72 2110
weighted avg 0.79 0.80 0.79 2110



[[1390 159]
[ 269 292]]


AxesSubplot(0.125,0.808774;0.62x0.0712264)



---------------------> DecisionTreeClassifier <----------------------


DecisionTreeClassifier(random_state=45)


Accuracy_score = 0.7393364928909952


Cross_Val_Score = 0.7272458538083539


roc_auc_score = 0.6712513046770442


classification_report
precision recall f1-score support

0 0.83 0.82 0.82 1549
1 0.51 0.53 0.52 561

accuracy 0.74 2110
macro avg 0.67 0.67 0.67 2110
weighted avg 0.74 0.74 0.74 2110



[[1265 284]
[ 266 295]]


AxesSubplot(0.125,0.808774;0.62x0.0712264)



---------------------> GaussianNB <----------------------


GaussianNB()


Accuracy_score = 0.6526066350710901


Cross_Val_Score = 0.6626806381740591


roc_auc_score = 0.7218969400072959


classification_report
precision recall f1-score support

0 0.92 0.57 0.71 1549
1 0.43 0.87 0.57 561

accuracy 0.65 2110
macro avg 0.67 0.72 0.64 2110
weighted avg 0.79 0.65 0.67 2110



[[889 660]
[ 73 488]]


AxesSubplot(0.125,0.808774;0.62x0.0712264)



---------------------> RandomForestClassifier <----------------------


RandomForestClassifier(random_state=45)


Accuracy_score = 0.7914691943127962


Cross_Val_Score = 0.7916647808095176


roc_auc_score = 0.6982309327275719


classification_report
precision recall f1-score support

0 0.83 0.90 0.86 1549
1 0.64 0.50 0.56 561

accuracy 0.79 2110
macro avg 0.73 0.70 0.71 2110
weighted avg 0.78 0.79 0.78 2110



[[1390 159]
[ 281 280]]


AxesSubplot(0.125,0.808774;0.62x0.0712264)



---------------------> GradientBoostingClassifier <----------------------


GradientBoostingClassifier(random_state=45)


Accuracy_score = 0.7995260663507109


Cross_Val_Score = 0.8061709960558645


roc_auc_score = 0.7122454944769152


classification_report
precision recall f1-score support

0 0.84 0.90 0.87 1549
1 0.65 0.53 0.58 561

accuracy 0.80 2110
macro avg 0.75 0.71 0.73 2110
weighted avg 0.79 0.80 0.79 2110



[[1392 157]
[ 266 295]]


AxesSubplot(0.125,0.808774;0.62x0.0712264)



---------------------> ExtraTreesClassifier <----------------------


ExtraTreesClassifier(random_state=45)


Accuracy_score = 0.776303317535545


Cross_Val_Score = 0.7817074550627181


roc_auc_score = 0.6810799676405571


classification_report
precision recall f1-score support

0 0.82 0.88 0.85 1549
1 0.60 0.48 0.53 561

accuracy 0.78 2110
macro avg 0.71 0.68 0.69 2110
weighted avg 0.76 0.78 0.77 2110



[[1370 179]
[ 293 268]]


AxesSubplot(0.125,0.808774;0.62x0.0712264)



---------------------> AdaBoostClassifier <----------------------


AdaBoostClassifier(random_state=45)


Accuracy_score = 0.7971563981042654


Cross_Val_Score = 0.8050307933531619


roc_auc_score = 0.714610886904207


classification_report
precision recall f1-score support

0 0.84 0.89 0.87 1549
1 0.64 0.54 0.59 561

accuracy 0.80 2110
macro avg 0.74 0.71 0.73 2110
weighted avg 0.79 0.80 0.79 2110



[[1380 169]
[ 259 302]]
  • After training n number of models, we can finally conclude that the decision tree is working the best in this case.
  • from above reports generated →
  • Logistic_reg(Algo)
  • 0.811662(Train_acc)
  • 0.797156(Test_acc)
  • 0.797156(Precision)
  • 0.797156(Recall)
  • 0.797156(F1_score)
  • 0.807600(Avg_CV)
  • Logistic Regresson gives us the best ‘Cross validation Score’
  • also We have Logistic Regression as the winner beating Gradient Boost with a margin of 0.005%.

FINAL CONFUSION MATRIX

CLASSIFICATION REPORTS AS →

precision    recall  f1-score   support

0 0.84 0.90 0.87 1549
1 0.65 0.52 0.58 561

accuracy 0.80 2110
macro avg 0.74 0.71 0.72 2110
weighted avg 0.79 0.80 0.79 2110

CONCLUDING REMARKS

in above model it is summarised that relation analysis between features and target with best visualisation

also best model has been predicted with best hyperparameter tuning and best score was derived and made 81% accuracy

from the over conclusions made →

people having very high tenure or very less tenure are leaving company

people don’t have the phone services aren’t enjoying other services , so probably customer is leaving , here company can work upon new schemes so that customer can get attract towards services.

there is no issue with monthly billing or method of payment , but maximum customers cant afford for two year subscription , company should come up with again new schemes and offers.

after model building, Logistic Regression algorithm looks best for the telecom customer churn dataset, which will predict the churn analysis.

Thank You

Author- SARANSH SHARDA

--

--

Saransh Sharda
0 Followers

I am working as a data scientist, currently working in Chandigarh.