Eigen values and Eigen vectors (PCA) - dimensionality reduction. (~23 mins)

Python code for t-SNE on MNIST: code (784 dimensions to 2 dimensions). It takes more than 30 minutes for the code to run (on a two core i5) for 15,000 data points and with perplexity parameter set to 50 and number of iterations set to 5000. With all the 42,000 data points, it's going to take longer than that. I haven't tried that out. 

​Documentation for some of the functions used in the code: 

Histograms, PDFs (EDA) (~16 mins)
Univariate analysis using PDF (~6 mins)
Which of the four features in iris dataset is most useful to do the classification?
The sepal-length and sepal-width have massive overlap between the three categories of flowers.
Python code for Iris histogram generation.

PCA specifically for dimensionality reduction (~14 mins)
Converting a 784 dimension MNIST datapoints to 200 dimension
The basic idea of PCA -- maximizing the variance of projected points.
200 dimenstions out of 784 dimensions on MNIST dataset helps you preserve 90% of the information/variance. 350 dimensions helps you preserve 95% of the information on the data; which is less than half the original set of dimenstions. The rest of the dimensions (784-350 = 434) only help you add the rest 5% of informatoin, which is quite astonishing.

Python code for the PCA cumulative variance: code

Alternative formulation of PCA using distance minimization. (~10 mins)

t-SNE (t-distributed Stochastic Neighborhood Embedding) (~6 mins)
t-SNE Wiki
PCA tries to preserve the global shape/structure of data. PCA doesn't care about the local structure of the data. It only cares about the direction that maximizes the variance. If you have a small set of point cluster lying outside the main set of points, those outlying clusters would still be projected to the main dimension of maximal variance, thereby loosing information. t-SNE on the ohter hand can choose to preserve the local structure. You can also make t-SNE choose to preserve global structure by changing parameters.

Python code for PCA on MNIST: code (784 dimensions to 2 dimensions)

​Documentation for some of the functions used in the code: 

Information Entropy (~13 mins, ~12 mins)

If something is more predictable, then it has less entropy than somethig else that is less predictable. A fair coin has more entropy than an unfair coin because it's easy to predict the outcome of an unfair coin than a fair coin. Another way to think about it -- information that you would obtain by learning the value of some unknown random variable or quantity.

Entropy is information we dont' have. Get information and you reduce the entropy value.

Crowding problem in t-SNE (~8 mins)
Sometimes it is impossible to preserve the distances between the points in all the neighborhoods using t-SNE; example using contradition; embedding a square on a 2d plane to a 1d plane.

Machine Learning - Part 2 

Geometric intuition of t-SNE (~8 mins)
Basic idea: preserve the distances of points in the neighborhood. For the points which are not in the neighborhood, t-SNE doesn't make any guarantees on the relation of the distances; it would be placed anywhere.

Visualizing MNIST dataset using PCA (~5 mins)
Converting 784 dimensions to 2 dimenstions. Projecting the data to top two eigen vectors. PCA is a weaker dimensionality reduction technique compared with t-SNE

Neighborhood of a point, Embedding (t-SNE) (~7 mins)
Embedding -- for every point in the high dimensional space, find an equivalent point in the low dimensional space.

Pair plots in Exploratory Data Analysis (EDA) (~13 mins)

  • Easy to generate pair plots using seaborn.
  • The diagonal elements on the matrix of plots are the PDFs for each feature.
  • Petal length and Petal width are pretty good at separating Petosa flowers from non-petosa flowers. See the plot on the third row, fourth column.
  • Python code for pair plots.

Limitations of pair plots (~2 mins)
  • Pair plots are easy to understand when the number of features (dimensions) are less.

How to apply t-SNE and interpret its output. (~38 mins)
Link to distill.pub for t-sne: link

Parameters of t-SNE: number of iterations, perplexity and epsilon
Perplexity parameter: number of points in the neighborhood which I want to preserve when I do dimensionality reduction using t-SNE.

Always try to run t-SNE with multiple perplexity values. If your perplexity almost equals the number of data points, you end up getting a mess. So, always try to keep the perplexity much less than the count of input data points.

Keep iterating until you reach a stable configuration.
t-SNE is not deterministic; you might end up getting slightly different results for each run.

t-SNE tries to expand dense group of points and shrinks/contracts sparse clusters. So, cluster sizes in t-SNE doens't mean anything. Also, t-SNE does not preserve distances between clusters.
Making sure t-SNE doesn't make sense from random junk of points.

Introduction to Iris data set and 2d scatter plot. (~7 mins)

  • EDA - exploratory data analysis and why it's important.
  • Type of flowers: Setosa, Versicolor and Virginica.
  • 4 features: sepal-lenght, sepal-width, petal-length, petal-width.
  • Visualizing the features using scatter plots (pandas.DataFrame.plot and using seaborn).
  • Sepal length and Sepal width features can be used to distinguish Setosa flowers from other kind of flowers.
  • 3d scatter plot for the dataset.
  • pandas.dataframe.plot doc
  • Plotly for iris dataset
  • 'One of the clusters contains Iris setosa, while the other cluster contains both Iris virginica and Iris versicolor and is not separable without the species information that Fisher used'-- wiki
  • Python code for the 2d scatter plots.

PCA for dimensionality reduction (~10 mins)
Using PCA to reduce 2 dim data to one dim data.
Using PCA to reduce 10 dim data to two dim data.
Doing PCA and still preserve 99% of the variance.

t-SNE on MNIST (~7 mins)
We cannot interpret cluster sizes or inter-cluster distances in t-SNE.
Points which are visually similar are goruped/clustered together in t-SNE.