• More data beats cleverer algorithsm.
  • All learners essentially work by grouping nearby examples into the same class; the key difference is in the meaning of 'nearby'.
  • Learn more models; not just one.
  • Ensemble models; bagging and boosting.
  • The generalization error of a boosted ensemble continues to improve by adding classifiers even after the training error has reached zero.
  • Contrary to intuition, there is no necessary connection between number of parameters of a model and it's tendency to overfit.
  • Just because a function can be represented does not mean it can be learned.
  • Correlation does not imply causation. On the other hand, Correlation is a sign of potential causal connection and we can use it as a guide to further the investigation




Doing simple cross-validation and k-fold cross validation for knn on a toy-dataset.

Sample python code (Jupyter notebook).

Sample data (3.concertriccir2.csv).





  • Bias-Variance tradeoff
  • Wiki: Bias-Variance tradeoffr
  • To check for high-bias: if train error is high, then the model is not fitting the training data properly and it may be the case of high bias.
  • To check for high-variance: if the train error is low but test (or cross-validation) error is high, then it means the model fits the training data points properly but cannot generalize to unseen points. It may be the case of high-variance. Another way to check high variance is, if you change the train data slightly, does the model error change? If it changes a lot, it may be because of high-variance.


  • Wiki: Precision and Recall
  • High precision means that an algorithm returned substantially more relevant results than irrelevant ones, while high recall means that an algorithm returned most of the relevant results.
  • Often, there is an inverse relationship between precision and recall, where it is possible to increase one at the cost of reducing the other.
  • Best way to remember the difference between sensitivity and specificity
  • F1 score = harmonic_mean of precision and recall


LSH (Locality Sensitve Hashing) for Cosine similarity.

How to use it for K-NN to get the k nearest neigbors for a d-dimensional vector space.

Faster than space partitioning by kd-tree data structure.

If 'w' is a unit-vector perpendicular to hyperplane 'pi_1'. Then for a point 'x' in the d-dimensional space, the sign of (w_transpose * x) would help us understand in which side of hyperplane 'pi_1' does the point lie. This is basically the dot-product between vectors 'w' and 'x'.

Generate 'm' random hyperplanes and for each point, find in which side of the hyperplace the point lies. Based on that, create a m-tuple for each point. Use this as the key for hashing.

It is a randomized algorithm and not a deterministic one. And there are chances that a neighboring point that lies on other side of hte hyperplane might be missed. The chances of that happening can be reduced by having 'L' different hash-tables with its own set of randomly generated 'm' hyperplanes. And combine the result as the Union of set of results from each hash-table.

  • Good read:A few useful things to know about Machine Learning
  • by Pedro Domingos
  • In practice, features are correlated and do not exhibit much variation. For these reasons, dimensionality reduction helps compress the data without loosing much signal and combat the curse of dimensionality.
  • Learning = Representation + Evaluatin + Optimization
  • The fundamental goal of machine learning is to generalize beyond the exaples in the training set. That will need more than just data; domain knowledge comes very much handy.
  • We don't have access to the function we need to optimize; we have to use training error as a surrogate for test error.
  • Like deduction, induction (what learners do) is a knowledge lever: it turns a small amount of input knowledge into a large amount of output knowledge. Induction is vastly more powerful than deduction, requiring much less input knowledge to produce useful results.
  • 'No Free Lunch' - no learner can beat random guessing over all possible functions to be learned.
  • One way to understand overfitting is decomposing generalization error into bias and variance
  • A common misconception is that overfitting is caused by noise; but sever overfitting can occur even in the absence of noise.


How the K-NN classifier behaves for various types of data.

u-shaped data, concentric circles (1 and 2), overlapped data, xor data, two spirals data, linear separable data, outlier data, random data.

Sample data.

plot_decision_regions api from mlextend.
To install: conda install -c conda-forge mlxtend

Sample code (Jupyter notebook).

  • Wiki: Feature Selection
  • Feature selection is different from dimensionality reduction. Both methods seek to reduce the number of attributes in the dataset, but a dimensionality reduction method do so by creating new combinations of attributes, where as feature selection methods include and exclude attributes present in the data without changing them.
  • Introduction to Feature Selection


  • Challenges related to imbalanced datasets.
  • Undersampling and Oversampling
  • Random oversampling (replicating minority class data points) is prone to overfitting since we are copying information.
  • Assigning class-weights to samples propotional to the number of points of that class.
  • SMOTE - Synthetic Minority Oversampling Technique
  • SMOTE explained.
  • Quora: What is an imbalanced dataset?
  • Wiki:Oversampling and Undersampling


  • Multiclass classification
  • One vs Rest (OVR) strategy
  • Wiki: One vs Rest




  • Generalizing correctly becomes exponentially harder as the dimensionality of the examples grows.
  • Our intuitions which come from 3d world, often do not apply in high-dimensional ones.
  • In high dimensions, most of the mass of a multivariate Gaussian distribution is not near the mean, but in an increasingly distant “shell” around it; and most of the volume of a high-dimensional orange is in the skin, not the pulp. --- ???
  • The main role of theoretical guarantees in machine learning is not as a criterion for practical decisions, but as a source of understanding and driving force for algorithm design.
  • If you have many independent features that each correlate well with the class, learning is easy. On the other hand, if the class is a very complex function of the features, you may not be able to learn it.
  • Feature engineering is more difficult because it’s domain-specific, while learners can be largely general-purpose.


Machine Learning - Part 5