---
aliases: Churn prediction, regression
---

- Data contains labeled values and are used to predict the answer for new data
- Predicting car prices based on car attributes using historical sales data
- Split the observational data into training set and test set, build model on train data, then measure accuracy of model on test data
- Caveats: data sets need to be large enough, representative, and selected at random to avoid bias (data may contain outliers which may go unnoticed)
- K-fold cross validation is used to make training more robust
- Output is probability score which is converted into a class prediction based on a threshold (e.g., more than 0.75 probability means customer will churn)

[Weighted Logistic Regression for Imbalanced Dataset | by Dinesh Yadav | Towards Data Science](https://towardsdatascience.com/weighted-logistic-regression-for-imbalanced-dataset-9a5cd88e68b)
[Must-know Machine Learning Questions – Logistic Regression (upgrad.com)](https://www.upgrad.com/blog/machine-learning-interview-questions-answers-logistic-regression/#1_What_is_a_logistic_function_What_is_the_range_of_values_of_a_logistic_function)
[ML Systems Design Interview Guide · Patrick Halina](http://patrickhalina.com/posts/ml-systems-design-interview-guide/#data-brainstorming-and-feature-engineering)
[An overview of correlation measures between categorical and continuous variables | by Outside Two Standard Deviations | Medium](https://medium.com/@outside2SDs/an-overview-of-correlation-measures-between-categorical-and-continuous-variables-4c7f85610365)

### Porter's churn prediction setup:
Problem statement : Customers drop-off is significantly high for order number 2-6. 
Objective : Reduce churn in this category. Increase customer retention and LTV. 

Performed user path analysis to understand user journey for churned vs retained customers
Data : 
- Order data: type, transaction value
- Order event log: stockouts, call flags from driver app, delta between estimated vs actual time and fare, time to place order, time to assign driver
- Order metrics on supply side and service levels 
- Customer support history 
- App login data for orders 1-6 
- Customer data: age, device type
- Modeling steps:
	- Test data, training data, 
	- Test on random samples to improve prediction accuracy 
	- Improve input variables 
	- Test on retained customers to reduce false positives 
	- Test on churned customers to reduce false negatives 
	- Class weight balancing - to reduce false negatives and increase accuracy 
	- Maximizing area under curve for ROC receiver operating characteristic 
	- Maximizing recall ratio = tp / (tp + fn) Measuring specificity and sensitivity Mathematically these are represented as: 
		- Sensitivity = (number correctly identified 1s)/(total number observed 1s) 
		- Specificity = (number correctly identified 0s)/(total number observed 0s)
	- Measure results using Chi-squared test to understand if predicted outcome matches actual outcome. P-value should be less than 0.05 to establish relation between churn/no churn and input features.
	- Confusion matrix is used to evaluate false positives and false negatives. 
	- "Recall" is prioritized to ensure we identify true churners (true positives)
		- Recall = TP / (TP + FN)
	- "Precision" is used to reduce marketing spend wastage by minimizing false positives 
		- Precision = TP / (TP + FP)