• More data beats cleverer algorithsm.
• All learners essentially work by grouping nearby examples into the same class; the key difference is in the meaning of 'nearby'.
• Ensemble models; bagging and boosting.
• The generalization error of a boosted ensemble continues to improve by adding classifiers even after the training error has reached zero.
• Contrary to intuition, there is no necessary connection between number of parameters of a model and it's tendency to overfit.
• Just because a function can be represented does not mean it can be learned.
• Correlation does not imply causation. On the other hand, Correlation is a sign of potential causal connection and we can use it as a guide to further the investigation

Doing simple cross-validation and k-fold cross validation for knn on a toy-dataset.

Sample python code (Jupyter notebook).

Sample data (3.concertriccir2.csv).

• To check for high-bias: if train error is high, then the model is not fitting the training data properly and it may be the case of high bias.
• To check for high-variance: if the train error is low but test (or cross-validation) error is high, then it means the model fits the training data points properly but cannot generalize to unseen points. It may be the case of high-variance. Another way to check high variance is, if you change the train data slightly, does the model error change? If it changes a lot, it may be because of high-variance.

• Wiki: Precision and Recall
• High precision means that an algorithm returned substantially more relevant results than irrelevant ones, while high recall means that an algorithm returned most of the relevant results.
• Often, there is an inverse relationship between precision and recall, where it is possible to increase one at the cost of reducing the other.
• Best way to remember the difference between sensitivity and specificity
• F1 score = harmonic_mean of precision and recall

LSH (Locality Sensitve Hashing) for Cosine similarity.

How to use it for K-NN to get the k nearest neigbors for a d-dimensional vector space.

Faster than space partitioning by kd-tree data structure.

If 'w' is a unit-vector perpendicular to hyperplane 'pi_1'. Then for a point 'x' in the d-dimensional space, the sign of (w_transpose * x) would help us understand in which side of hyperplane 'pi_1' does the point lie. This is basically the dot-product between vectors 'w' and 'x'.

Generate 'm' random hyperplanes and for each point, find in which side of the hyperplace the point lies. Based on that, create a m-tuple for each point. Use this as the key for hashing.

It is a randomized algorithm and not a deterministic one. And there are chances that a neighboring point that lies on other side of hte hyperplane might be missed. The chances of that happening can be reduced by having 'L' different hash-tables with its own set of randomly generated 'm' hyperplanes. And combine the result as the Union of set of results from each hash-table.

• by Pedro Domingos
• In practice, features are correlated and do not exhibit much variation. For these reasons, dimensionality reduction helps compress the data without loosing much signal and combat the curse of dimensionality.
• Learning = Representation + Evaluatin + Optimization
• The fundamental goal of machine learning is to generalize beyond the exaples in the training set. That will need more than just data; domain knowledge comes very much handy.
• We don't have access to the function we need to optimize; we have to use training error as a surrogate for test error.
• Like deduction, induction (what learners do) is a knowledge lever: it turns a small amount of input knowledge into a large amount of output knowledge. Induction is vastly more powerful than deduction, requiring much less input knowledge to produce useful results.
• 'No Free Lunch' - no learner can beat random guessing over all possible functions to be learned.
• One way to understand overfitting is decomposing generalization error into bias and variance
• A common misconception is that overfitting is caused by noise; but sever overfitting can occur even in the absence of noise.

How the K-NN classifier behaves for various types of data.

u-shaped data, concentric circles (1 and 2), overlapped data, xor data, two spirals data, linear separable data, outlier data, random data.

Sample data.

plot_decision_regions api from mlextend.
To install: `conda install -c conda-forge mlxtend `

Sample code (Jupyter notebook).

• Wiki: Feature Selection
• Feature selection is different from dimensionality reduction. Both methods seek to reduce the number of attributes in the dataset, but a dimensionality reduction method do so by creating new combinations of attributes, where as feature selection methods include and exclude attributes present in the data without changing them.
• Introduction to Feature Selection

• Challenges related to imbalanced datasets.
• Undersampling and Oversampling
• Random oversampling (replicating minority class data points) is prone to overfitting since we are copying information.
• Assigning class-weights to samples propotional to the number of points of that class.
• SMOTE - Synthetic Minority Oversampling Technique
• SMOTE explained.
• Quora: What is an imbalanced dataset?
• Wiki:Oversampling and Undersampling

• Multiclass classification
• One vs Rest (OVR) strategy
• Wiki: One vs Rest

• Generalizing correctly becomes exponentially harder as the dimensionality of the examples grows.
• Our intuitions which come from 3d world, often do not apply in high-dimensional ones.
• In high dimensions, most of the mass of a multivariate Gaussian distribution is not near the mean, but in an increasingly distant “shell” around it; and most of the volume of a high-dimensional orange is in the skin, not the pulp. --- ???
• The main role of theoretical guarantees in machine learning is not as a criterion for practical decisions, but as a source of understanding and driving force for algorithm design.
• If you have many independent features that each correlate well with the class, learning is easy. On the other hand, if the class is a very complex function of the features, you may not be able to learn it.
• Feature engineering is more difficult because it’s domain-specific, while learners can be largely general-purpose.

Machine Learning - Part 5