--- aliases: Churn prediction, regression --- - Data contains labeled values and are used to predict the answer for new data - Predicting car prices based on car attributes using historical sales data - Split the observational data into training set and test set, build model on train data, then measure accuracy of model on test data - Caveats: data sets need to be large enough, representative, and selected at random to avoid bias (data may contain outliers which may go unnoticed) - K-fold cross validation is used to make training more robust - Output is probability score which is converted into a class prediction based on a threshold (e.g., more than 0.75 probability means customer will churn) [Weighted Logistic Regression for Imbalanced Dataset | by Dinesh Yadav | Towards Data Science](https://towardsdatascience.com/weighted-logistic-regression-for-imbalanced-dataset-9a5cd88e68b) [Must-know Machine Learning Questions – Logistic Regression (upgrad.com)](https://www.upgrad.com/blog/machine-learning-interview-questions-answers-logistic-regression/#1_What_is_a_logistic_function_What_is_the_range_of_values_of_a_logistic_function) [ML Systems Design Interview Guide · Patrick Halina](http://patrickhalina.com/posts/ml-systems-design-interview-guide/#data-brainstorming-and-feature-engineering) [An overview of correlation measures between categorical and continuous variables | by Outside Two Standard Deviations | Medium](https://medium.com/@outside2SDs/an-overview-of-correlation-measures-between-categorical-and-continuous-variables-4c7f85610365) ### Porter's churn prediction setup: Problem statement : Customers drop-off is significantly high for order number 2-6. Objective : Reduce churn in this category. Increase customer retention and LTV. Performed user path analysis to understand user journey for churned vs retained customers Data : - Order data: type, transaction value - Order event log: stockouts, call flags from driver app, delta between estimated vs actual time and fare, time to place order, time to assign driver - Order metrics on supply side and service levels - Customer support history - App login data for orders 1-6 - Customer data: age, device type - Modeling steps: - Test data, training data, - Test on random samples to improve prediction accuracy - Improve input variables - Test on retained customers to reduce false positives - Test on churned customers to reduce false negatives - Class weight balancing - to reduce false negatives and increase accuracy - Maximizing area under curve for ROC receiver operating characteristic - Maximizing recall ratio = tp / (tp + fn) Measuring specificity and sensitivity Mathematically these are represented as: - Sensitivity = (number correctly identified 1s)/(total number observed 1s) - Specificity = (number correctly identified 0s)/(total number observed 0s) - Measure results using Chi-squared test to understand if predicted outcome matches actual outcome. P-value should be less than 0.05 to establish relation between churn/no churn and input features. - Confusion matrix is used to evaluate false positives and false negatives. - "Recall" is prioritized to ensure we identify true churners (true positives) - Recall = TP / (TP + FN) - "Precision" is used to reduce marketing spend wastage by minimizing false positives - Precision = TP / (TP + FP)