- More data beats cleverer algorithsm.
- All learners essentially work by grouping nearby examples into the same class; the key difference is in the meaning of 'nearby'.
- Learn more models; not just one.
- Ensemble models; bagging and boosting.
- The generalization error of a boosted ensemble continues to improve by adding classifiers even after the training error has reached zero.
- Contrary to intuition, there is no necessary connection between number of parameters of a model and it's tendency to overfit.
- Just because a function can be represented does not mean it can be learned.
- Correlation does not imply causation. On the other hand, Correlation is a sign of potential causal connection and we can use it as a guide to further the investigation

- Local outlier factor: finding anomalous data points by measuring the local deviation of a given data point with respect to its neighbours.
- Wiki: Local Outlier factor
- sklearn.neighbors.LocalOutlierFactor

Doing simple cross-validation and k-fold cross validation for knn on a toy-dataset.

Sample python code (Jupyter notebook).

Sample data (3.concertriccir2.csv).

- Wiki: Confusion Matrix
- Helps you visualize the performance of an algorithm
- Confusion matrix and conditional probabilities

- Using cross validation with knn.
- Cross validation - wiki
- How to select the hyper-parameter 'k' for knn.

- Covariate Shift
- Covariate shift: Unearthing hidden problems in Real World Data Science
- Covariate shift: ML Blog
- Covariate shift: StackExchange
- Pykliep: A density ratio estimator package for python using the KLIEP algorithm
- KLIEP: Kullback-Leibler Importance Estimation Procedure

- Bias-Variance tradeoff
- Wiki: Bias-Variance tradeoffr
- To check for high-bias: if train error is high, then the model is not fitting the training data properly and it may be the case of high bias.
- To check for high-variance: if the train error is low but test (or cross-validation) error is high, then it means the model fits the training data points properly but cannot generalize to unseen points. It may be the case of high-variance. Another way to check high variance is, if you change the train data slightly, does the model error change? If it changes a lot, it may be because of high-variance.

- Wiki: Precision and Recall
- High
*precision*means that an algorithm returned substantially more relevant results than irrelevant ones, while high*recall*means that an algorithm returned most of the relevant results. - Often, there is an inverse relationship between precision and recall, where it is possible to increase one at the cost of reducing the other.
- Best way to remember the difference between sensitivity and specificity
- F1 score = harmonic_mean of precision and recall

LSH (Locality Sensitve Hashing) for Cosine similarity.

How to use it for K-NN to get the k nearest neigbors for a d-dimensional vector space.

Faster than space partitioning by kd-tree data structure.

If 'w' is a unit-vector perpendicular to hyperplane 'pi_1'. Then for a point 'x' in the d-dimensional space, the sign of (w_transpose * x) would help us understand in which side of hyperplane 'pi_1' does the point lie. This is basically the dot-product between vectors 'w' and 'x'.

Generate 'm' random hyperplanes and for each point, find in which side of the hyperplace the point lies. Based on that, create a m-tuple for each point. Use this as the key for hashing.

It is a randomized algorithm and not a deterministic one. And there are chances that a neighboring point that lies on other side of hte hyperplane might be missed. The chances of that happening can be reduced by having 'L' different hash-tables with its own set of randomly generated 'm' hyperplanes. And combine the result as the Union of set of results from each hash-table.

- Good read:A few useful things to know about Machine Learning by Pedro Domingos
- In practice, features are correlated and do not exhibit much variation. For these reasons, dimensionality reduction helps compress the data without loosing much signal and combat the curse of dimensionality.
- Learning = Representation + Evaluatin + Optimization
- The fundamental goal of machine learning is to generalize beyond the exaples in the training set. That will need more than just data; domain knowledge comes very much handy.
- We don't have access to the function we need to optimize; we have to use training error as a surrogate for test error.
- Like deduction, induction (what learners do) is a knowledge lever: it turns a small amount of input knowledge into a large amount of output knowledge. Induction is vastly more powerful than deduction, requiring much less input knowledge to produce useful results.
- 'No Free Lunch' - no learner can beat random guessing over all possible functions to be learned.
- One way to understand overfitting is decomposing generalization error into
*bias*and*variance* - A common misconception is that overfitting is caused by noise; but sever overfitting can occur even in the absence of noise.

How the K-NN classifier behaves for various types of data.

u-shaped data, concentric circles (1 and 2), overlapped data, xor data, two spirals data, linear separable data, outlier data, random data.

Sample data.

plot_decision_regions api from mlextend.

To install: `conda install -c conda-forge mlxtend `

Sample code (Jupyter notebook).

- Wiki: Feature Selection
- Feature selection is different from dimensionality reduction. Both methods seek to reduce the number of attributes in the dataset, but a dimensionality reduction method do so by creating new combinations of attributes, where as feature selection methods include and exclude attributes present in the data without changing them.
- Introduction to Feature Selection

- Challenges related to imbalanced datasets.
- Undersampling and Oversampling Random oversampling (replicating minority class data points) is prone to overfitting since we are copying information.
- Assigning class-weights to samples propotional to the number of points of that class.
- SMOTE - Synthetic Minority Oversampling Technique
- SMOTE explained.
- Quora: What is an imbalanced dataset?
- Wiki:Oversampling and Undersampling

- Multiclass classification
- One vs Rest (OVR) strategy
- Wiki: One vs Rest

- Curse of Dimensionality
- Wiki: Curse of Dimensionality
- Hughes phenomenon

- Generalizing correctly becomes exponentially harder as the dimensionality of the examples grows.
- Our intuitions which come from 3d world, often do not apply in high-dimensional ones.
- In high dimensions, most of the mass of a multivariate Gaussian distribution is not near the mean, but in an increasingly distant “shell” around it; and most of the volume of a high-dimensional orange is in the skin, not the pulp. --- ???
- The main role of theoretical guarantees in machine learning is not as a criterion for practical decisions, but as a source of understanding and driving force for algorithm design.
- If you have many independent features that each correlate well with the class, learning is easy. On the other hand, if the class is a very complex function of the features, you may not be able to learn it.
- Feature engineering is more difficult because it’s domain-specific, while learners can be largely general-purpose.

Machine Learning - Part 5