##### Key definitions **AI: Artificial Intelligence**: a broad discipline with the goal of creating intelligent machines, as opposed to the natural intelligence that is demonstrated by humans and animals. **AGI or ASI: Artificial General or Super Intelligence**: a term used to describe future machines that could match and then exceed the full range of human cognitive ability across all economically valuable tasks **ML: [[Machine Learning]]**: a subset of AI that often uses statistical techniques to give machines the ability to "learn" from data without being explicitly given the instructions for how to do so. This process as known as “training” a “model” using a learning “algorithm” that progressively improves model performance on a specific task. **Algorithms** - set of templates of analytical logic used in machine learning **Model** - specific type of implementation of algorithm to solve a specific problem statement. More than one model can solve a problem with a given accuracy and confidence level. **Training set** - data used to let the model identify patterns in the data and create rules **Validation set** - Fresh set of data to check or validate how well the trained models work on unbiased dataset **Test set** - Another fresh set of data to determine actual performance of the model ##### Types of learning - Depending on the data and problem statement, the type of machine learning can vary. 1. **Supervised learning** - Algorithm gets set of labeled data in training set 1. **Regression** - Inferring value of dependent variable based on other pieces of data E.g. time prediction - forecasting future values 2. **Classification** - identifying which category an entity belongs to out of given set of categories 2. **Unsupervised learning** - Algorithm infers patterns in data without the need of any labels. Data is unlabeled. 1. **Clustering** - Algorithm finds which data points are similar to one another and groups them together. 2. **Association** - Categorize objects into buckets based on relationship. E.g. people who bought X also bought Y. 3. **Anomaly detection** - identifying unexpected patterns in data that need to be flagged. E.g. cybersecurity malware threat identification 3. **Semi-supervised learning** - Algorithm requires some training data, but lot less than in case of supervised learning. 4. **Reinforcement learning** - Algorithm starts with limited set of data and learns as it gets more feedback about its predictions over time to meet a goal. E.g. learning how to win game of chess [[Reinforcement learning - Ashwin Rao.pdf]] **Features** - ML model data is organized into features (also called variables or attributes) - these are relevant, independent pieces of data useful for prediction. **Objective function** - the goal for which ML is optimizing or predicting. Depends on the business goal. E.g. engagement business goal will dictate objective function of calculating probability of user clicking on product if they saw it. Or direct revenue increase business goal will dictate objective function of calculating probability of user purchasing the product if they saw it. **Explainability and interpretability** - ?? **Modeling and measurement pitfalls** - 1. **Overfitting** - Model is overfitted when it follows the data so closely that all the noise is also described by the model. Happens when accuracy of model on training set is significantly higher than that on testing set. 2. **Precision vs Recall** - 1. Precision - % of true positive predictions out of all positive predictions. 2. Recall - % of positive predictions out of all true positives in the test data 3. Tradeoffs - Need higher precision (low false positives) and need higher recall (low false negatives). The tradeoff is a business decision. 4. Precision and Recall must be considered to understand true accuracy of the model. ##### Common Notations Dependent variables: aka target, output Independent variables: aka predictors, features or input Observation: single collection of input and output variables Dataset: multiple observations combined form a dataset Training dataset: used to train ML model Validation dataset: used to validate or compare ML models with different parameters Test dataset: used to evaluate a final ML model Quantitative variables: Continuous Near-continuous Categorical variables: Discreet set of groups e.g. nationality Modeling tasks are known as Regression: if target/output is quantitative Classification: if target/output is categorical