- Whenever a distribution follows Power Law, it's called Pareto distribution
- Pareto Distribution has two parameters: x_m and alpha. x_m is the point where the left peak of distribution lies.
- As alpha reduces, the distribution will have fatter tail and vice versa. When alpha is infinity, the distribution will have a single peak at the value of x_m (dirac delta function)
- If you want to check if some follows a pareto distribution, you can use the log-log plot and check if it gives you a straight line (or you can use the q-q plot technique as well).

- Links:
- Pareto Principle
- Power law

Machine Learning - Part 4

- Bag of Words model creation using Scikit-learn
- CountVectorizer
- TfidfVectorizer
- When is Unigrams more stable than Bigrams or n-grams? : stackoverflow link

- Links:
- Scikit-learn countVectorizer
- Text feature extraction using bag of words
- tf-idf-vectorizer from scikit-learn library

- Idea behind word2vec: 'the meaning of a word can be inferred by the company it keeps'.
- If you have two words that have very similar neighbors (the context in which it is used is similar), then these words are probably quite similar in meaning or are at least related.

- Links:
- A word2vec tutorial using gensim (from Kavita ganesan)
- Tutorial - develop word embeddings in python with gensim
- Gensim word2vec documentation

1. Calculating the P-Value in Statistics:

- P-values: Probability of obtaining a sample 'more extreme' than the ones observed in your data, assuming H0 is true.
- P-values are just another trigger to decide when we should reject the null hypothesis and when we should fail to reject it.

Eigen values and Eigen vectors (~17 mins)

- Some special vectors do remain on their span even after the linear transformation - we call them the eigen vectors.
- Each Eigen vector is associated with it an Eigen value, which is the factor by which the Eigen vectors are stretched or squished during the transformation.
- Consider some 3d rotation; if you can find an eigen vector for that rotation, you have actually found the axis of that rotation. But it's Eigen value will have to be 1 (since rotation transformation do not stretches or squishes anything).
- Meaning of the equation (A v = lambda v): there is a vector 'v' which when transformed using the matrix 'A' gives rise to a vector which can be got by just multiplying 'v' with a scalar value (which basically squishes or expands this vector). Solve this equation for lambda and v to find the eigen value (lambda) and eigen vector (v).
- Not every transformation needs to have eigen vectors. For example a 90 degree rotation transformation doens't have any eigen vectors on the 2d plane.
- For a transformation that scales everything by 2, the only Eigen value is 2 and every vector in the plane gets to be an eigen vector with that eigen value.
- Basis vectors that are also Eigen vectors: Eigenbasis

Bag Of Words model to convert a document as a point on the d-dimensional vector space.

Bootstrap based confidence interval of medians (for heights of a population):

- sklearn.utils.resample
- Kaggle European Soccer database to get sample heights
- With a small sample of size 10 heights, we got the 95% confidence interval of 162 and 176. The difference is 14, which is quite large since the sample size is small. We also set the resample size as the original size of the sample is 10 is already a small sample size number.
- The same is experiment conducted with the data from Kaggle on European soccer players (with around 11k heights). The 95% confidence interval of heights this time is 177.80 and 185.42 with re-sample size of 30. The difference is 7.62, which is more accurate than the previous experiment.
- When the resample size was increased to 100 (from the 11k heights), the 95% CI for heights was 180.34 and 182.88. The difference is just around 2.

Using Box-Cox transform to convert log-gamma distribution (a type of power law distribution) to normal distribution. Plotted with QQ-plot.

- PP plot vs QQ plot
- Log-Gamma distribution (from SciPy)
- Box-Cox (from SciPy) - link1
- Box-Cox - link2
- Box-Cox - link3

- Intro to Hypothesis Testing in Statistics
- Figuring out the null and alternate hypothesis from the problem statement

- Video: Investigate whether bootstrap confidence intervals work using computer simulation.
- Stackexchange ans for why bootstrap works.

- Z-score basically tells you how many standard deviations are you above or below the mean value.
- z-scores can be negative, which means you are below the mean value.

- It's a non-parametric test that compares the overall shaped of the distribution (not specifically central tendency, dispersion or other parameters).
- The null-hypothesis H0 for the two-sample version of KS test: Two samples are drawn from populations that follow the same underlying distribution.
- Test statistic D: maximum absolute difference between two CDFs.
- If H0 is true and if the samples come from same distribution, then D will be 0 or at least very less.
- Kolmogorovâ€“Smirnov test
- Infimum and Supremum
- Empirical Distribution function
- KS Test (nist.gov)
- KS test only applies to continuous distributions.
- How to interpret p-values in KS test.
- How to interpret kstest and ks-2samp.
- QQ-plot or KS-test

- Snowball stemmer - also called Porter2 stemmer
- Snowball is a small string processing language designed for creating stemming algorithms for use in Information Retrieval.

Note: great video.

Change of basis (~12 mins)

- How to translate between coordinate systems?

- Video - Introduction to Permutation testing
- Resampling statistics: Randomization and the Bootstrap
- stackexchange question related to resampling
- Resampling statistics - link 2
- Jackknife resampling (wiki)

Probability Distributions (part 1):

- Poisson, classic example: chance of Prussian cavalryman being killed by horse kick
- The data scientist's crib sheet (Sean Owen)
- Guiness beer and Students-P, William Gosset
- Weibull distribution, log-normal distribution
- Gamma function and Gamma distribution
- Normal distribution
- Exponential distribution
- Central limit theorem
- 3-parameter Weibull distribution
- Identifying distribution of data using minitab

- Vector space model: algebraic model for representing text documents
- Each word (or sentence) can be represented as a single point on an n-dimensional vector space.
- If a word is represented using a vector, then the number of dimensions of the vector space will be the count of unique words in corpus (set of documents).
- Two simple ways to convert a sentence to a vector: avg-word2vec and td-idf-word2vec.

- Links:
- Vector space model for information retrieval from text
- tf-idf model
- Gerard Salton
- Ranking web-pages using tf-idf strategy

Abstract vector spaces (~17 mins)