K-Means Clustering

October 10, 2018May 21, 2023Archit Vora Leave a comment

This post is a lecture summary of a course by the University of Washington available on Coursera: Machine Learning: Clustering and Retrieval.

Application

We want to discover groups of articles (e.g., sports, world news) without knowing the labels in advance. We can assign labels after finding clusters. Our goal is to learn user preferences, whether she likes sports/technology etc.

Users provide feedback on whether they liked an article or not. We can accumulate this feedback within specific clusters to understand preferences.

Contrast this with building a classification model that predicts whether a user will like a specific article. For that, we would need to build a classifier for each user and score each (article, user) pair.

Alternatively, we can determine the types of articles each user prefers and show nearby articles using K-nearest neighbors (KNN). Clustering offers a more elegant and simplistic approach.

Unsupervised Task

Clustering is an unsupervised task since there are no input-output labels. Unsupervised learning poses challenges:

Defining clusters can be easy or hard, depending on the case.
Our algorithm works well for cases with intermediate difficulty.
There are still unsolved clustering problems.

Clustering in General

A cluster is defined by its center or centroid and its shape, often represented by an ellipsoid.

K-Means

Scoring in K-means involves assigning a point to a given cluster. The scoring mechanism is the distance to the cluster center.

K-means alternates between two steps:

Assign data to clusters.
Update cluster centers.

Both steps are minimization steps, making K-means resemble a coordinate descent algorithm. We alternate between minimizing in two directions: μ (cluster assignment) and z (cluster center).

Convergence

K-means is a non-convex optimization problem, so the global minimum is not guaranteed. We aim to find zᵢ and μᵢ that minimize the objective function.

By optimizing via coordinate descent, K-means can converge to a local minimum. However, convergence depends on the initialization. To improve convergence, we use smart initialization as described below.

Smart Initialization

Pick the first center randomly.
Calculate the distance to the nearest center for all data points.
Choose the next center with a probability proportional to the squared distance.

To perform probability sampling, calculate the cumulative probability as described in [1]. When K-means is followed by this initialization, it is known as K-means++.

Trade-offs

K-means++ takes more time for initialization but converges faster. Additionally, the converged solution is better than that of regular K-means.

Assessing Quality

Quality can be assessed by evaluating the objective value after convergence, which is referred to as cluster heterogeneity. A smaller heterogeneity indicates a better algorithm. Tighter clusters are more homogeneous.

Beware of overfitting: as K increases, heterogeneity always decreases.

Choosing K

To choose K, you can heuristically plot heterogeneity for various values of K and select the one at the elbow. Alternatively, you can use hierarchical clustering, which doesn’t require specifying K in advance. We can also perform Silhouette Analysis.

References

[1]: Stack Overflow: How exactly does k-means++ work?

Nearest Neighbour Search

October 9, 2018May 21, 2023Archit Vora 1 Comment

This post is a lecture summary of a course by the University of Washington available on Coursera: Machine Learning: Clustering and Retrieval.

Problem: Document Retrieval

Given a document, we want to search for similar documents in a corpus.

Representation: How to Represent Documents as Vectors?

Bag of Words: Count the frequency of each word in the vocabulary.
TF-IDF: Calculate the IDF (inverse document frequency) of each word in the vocabulary. Weight word frequency (TF) by its IDF in a given document.
- TF: Same as bag of words.
- IDF: Higher for rare words, calculated as log(D_total / (1 + D_word)).

Distance Metrics

There are different distance metrics used to measure similarity or dissimilarity between vectors. Let’s explore some commonly used distance metrics:

Euclidean Distance:

Euclidean distance between two vectors x and y can be calculated using the vectorized form: sqrt((x-y).T.(x-y)).
It is beneficial to normalize features because features can have different units and varying ranges.
Normalization can be achieved by dividing either by the range or by the variance of the features.

Scaled Euclidean Distance:

Scaled Euclidean distance allows for weighting different dimensions differently based on their importance.
For example, in text analysis, we may want to give more weight to the title/abstract than the body of an article.
This can be achieved by summing up the word vectors of the article and title and using a diagonal matrix A that contains weights for each feature.
The vectorized form for scaled Euclidean distance is: sqrt((x-y).T.A.(x-y)).

Cosine Similarity:

Cosine similarity measures the cosine of the angle between two vectors.
Similarity = x.dot(y).
Distance = 1 – similarity.
Cosine similarity is not a proper “distance” metric because it does not satisfy the triangular property (|ab| + |bc| > |ac|).
Cosine similarity is commonly used in Natural Language Processing (NLP) tasks due to its computational efficiency for sparse features.
Feature vectors in NLP tend to be sparse.
The range of similarity values is generally between -1 and 1. However, when features are non-negative, the range becomes 0 to 1, which is the case with TF-IDF.
Normalization can be performed instead of using the inner product: Similarity = x.dot(y) / (|x||y|).
Normalization avoids doubling the length of documents by copying.
One side effect of normalization is that long documents and tweets may appear similar, which is undesired when suggesting tweets to someone reading a document.
To address this, an alternative approach is to cap the maximum word count that can appear in a vector.

Cosine vs Euclidean:

The choice between cosine and Euclidean distance depends on the use case.
Cosine distance is more suitable for text features where the magnitude of the vector is not as important as the orientation.
Euclidean distance is a better choice for count features, such as measuring how many times a document was read.
For mixed feature types, both cosine and Euclidean distances can be computed (possibly with a subset of relevant features) and combined using appropriate weights, considering the trade-offs and requirements of the specific problem.

Brute Force

Calculate distance to each document in corpus
Suppose there are N documents in corpus
Complexity would be O(N) for 1-NN
It will be O(N log k) for k-NN
- Using priority queue
It is computationally inefficient when we N is large and we need to query often

KD-Trees

KD-Trees are used to divide the space into hierarchical regions via a tree structure. Here’s how it works:

At each node, we store the smallest bounding box that contains all the data points in that region.
While querying (for 1 nearest neighbor, which can be extended for k nearest neighbors):
- First, we find the leaf node where the query point belongs.
- Then, we find the nearest neighbor within that node.
- We keep moving up in the tree.
- If the distance between the query point and the bounding box is greater than the distance found so far, we skip that region and move up the tree.
- If it is less, we traverse down the tree.
To calculate the distance between a point and a rectangle, you can use appropriate distance metrics like Euclidean distance or other suitable measures. It is computationally easy rectangles are axis aligned[1]

Construction of KD-Trees:

Here are the steps involved in constructing KD-Trees:

Choose a splitting strategy:
- Widest dimension: Split based on the dimension with the widest range.
- Alternating dimension: Split alternately between dimensions.
Determine the value to split on (split value):
- Median: Choose the median value of the points along the splitting dimension.
- Center point: Compute the center point as the average of the left and right extremes.
Stop conditions for splitting:
- If there are fewer points left than a specified threshold.
- If the minimum width of the region is achieved.

How to Prune More Aggressively:

To prune more aggressively in KD-Trees, consider the following approach:

Suppose the nearest distance found is r.
Instead of pruning points that are farther than distance r, prune all points that are farther than (r/a), where a > 1.
This allows for more aggressive pruning and can speed up the search process.

Limitations of KD-Trees:

KD-Trees face challenges in high-dimensional spaces. Here are some limitations:

In high dimensions, the radius of the search space is generally large.
For d dimensions, the radius intersects 2^d hypercubes, resulting in an extensive search space.
One possible solution is to prune all points that are farther away than (r/a), where a > 1, to mitigate these issues.

LSH – Locality Sensitive Hashing

Locality Sensitive Hashing (LSH) is a technique used to classify data points into various bins and search within a given bin. Here’s how it works:

The idea is to group data points into bins based on their proximity and search within the relevant bin(s).
If necessary, nearby bins can also be explored to expand the search.
To determine nearby bins, multiple random lines (or planes in higher dimensions) are used.
Let’s consider a motivating example with points in a 2D plane:
- A single random line passing through the origin is chosen.
- Points that are closer to each other will mostly fall into the same bin.
However, using just one line can lead to more data points in a single bin, increasing query time.
The solution is to use multiple random lines (or planes) to achieve more efficient classification of neighboring bins.
For example, let’s consider three lines, f1, f2, and f3, defined as f(x,y) = 3x + 4y = 0:
- If f1(p,q) < 0, f2(p,q) > 0, and f3(p,q) > 0 for a given point (p,q), we assign it to the bucket [0,1,1].
To facilitate searching within bins, a dictionary can be used, where the key represents the bin (e.g., [0,1,1], [0,1,0]), and the value is a list of all points in that bin.
To find nearby bins:
- Flip one bit to find bins with a single bit difference.
- For more exploration, flip two bits to find bins with two bits difference.
It is possible for the nearest point to be in a different bin.
The number of neighboring bins to explore depends on factors such as computational budget or a predefined quality threshold.
In some cases, an exact nearest neighbor may not be necessary, and a point with a predefined quality (epsilon) is sufficient.
LSH is useful in scenarios like document suggestion, where similar documents need to be retrieved efficiently.
LSH can also be extended to high-dimensional spaces by using planes or hyperplanes instead of lines.
For assigning bins to data points, dot product calculations are performed, which work well even in large dimensions, especially when the feature vectors are sparse.

References

[1] https://stackoverflow.com/questions/5254838/calculating-distance-between-a-point-and-a-rectangular-box-nearest-point

[2] https://gamedev.stackexchange.com/questions/44483/how-do-i-calculate-distance-between-a-point-and-an-axis-aligned-rectangle

Overfitting in Decision Trees

October 3, 2018May 21, 2023Archit Vora 2 Comments

We had previously written about classification trees and regression trees.

In this blog we will mainly write about preventing over fitting in decision trees.

Handling missing values for decision tree is described here.

Over-fitting

For decision trees as depth increases training error always go towards zero.
Occam’s razor philosophy
- Suppose there exists two explanation for an occurrence
- Then less complex explanation is generally correct
- You have headache and stomachache.
- Disease D1 explains headache, D2 explains stomachache, while D3 explains both
- Instead of diagnosing patient with both D1 and D2, saying it D3 would be more accurate in general
So when you have two trees with same validation error, choose one with lesser complexity
There are two approaches:
- Early Stopping
- Pruning

Early Stopping

There can be three stopping condition

Pre define max_depth
1. It can be tuned via cross validation, which increases training time
2. Does not work well when you have less data
Stop if decrease in classification error is less than some threshold
1. It can become tricky as in case of XOR
2. XOR is very famous case in classification literature
Stop if very few point are left in the node
1. This works pretty well in practice
2. Stop if node has say 10 to 100 sample

Pruning

Pruning generally is better solution than early stopping
Here we build the entire tree first and than remove certain node
Criterion for removing nodes:
- cost(tree) = error(tree) + λ*L(tree)
- L(tree) = no of leaf nodes
if cost(deep_tree) > cost(small_tree) then don’t split that node
Algorithm:
- Start from bottom up the tree and traverse up and test one node each time
- Calculate cost
- Make decision to prune it or not

References:

Course by University of Washington

https://www.coursera.org/learn/ml-classification

Handling missing values in Decision Tree

October 3, 2018May 21, 2023Archit Vora 1 Comment

We had previously written about classification trees and regression trees.

In this blog we will mainly write about handling missing values specifically for decision trees.

Handling over-fitting for decision tree is described here.

Missing Values

Missing value can cause issues at :

Training time
Prediction time
1. What if while prediction sample does not have value for some feature

Purification by Skipping

This approach involves skipping rows or entire features that contain missing values.
However, this method comes with a drawback as it results in the loss of valuable information.
It is important to note that skipping missing values during training will not resolve the issue during prediction time

Imputation

Imputation refers to the process of filling in missing values with estimated or imputed values.
It provides a means to address missing values during both training and prediction.
A dictionary can be created to store imputed values for each feature, which can then be utilized during prediction.
For categorical features, a common imputation method is to use the mode, while for numeric features, the average or median is often employed.
Advanced techniques, such as the expectation maximization algorithm, can also be utilized for imputation.

However, it’s essential to be aware that imputation can introduce systematic biases into the data. For example, let’s consider a scenario where people from Washington tend to not provide their age, while the rest of the United States does. This could lead to a systematic bias in the imputed values, as the imputation process may assign ages based on data from other regions.

Algo specific technique

In decision tree algorithms, there are specific techniques to handle missing values. At each node in the decision tree, we need to determine which branch the missing value will take. It’s possible for the same feature to have different values at different nodes. To address this, we can make a tweak in the decision tree algorithm:

During feature selection, consider assigning the missing value to each possible branch.
Evaluate the effect of assigning the missing value to each branch by measuring the reduction in classification error.
Choose the feature and branch that result in the highest reduction in classification error when the missing value is assigned.

However, what if we encounter a node during prediction where we didn’t have any missing values during training? In such cases, we can store the direction with the most number of samples at that node as the default direction for missing values. Let’s call this direction “max_sample.”

If we encounter missing values while training the decision tree, we can override the default “max_sample” direction with the observed direction for the missing values.

By incorporating these techniques, we can effectively handle missing values in decision trees and make informed decisions based on the available data.

References:

Course by University of Washington

https://www.coursera.org/learn/ml-classification

On Classification Accuracy – 2

October 1, 2018October 25, 2020Archit Vora 1 Comment

We have already talked about it in this post. Just want to add few more things after finishing a course. This post is just an extension of above with some practical considerations.

We are claiming that accuracy may not be a good measure always. When you are building automated machine learning you must trust it.

Case Study

You want to show positive reviews on your website.
Say in your dataset 90% reviews are negative.
A classifier can achieve 90% accuracy by predicting all of them as negative.
But what you are interested in is finding out remaining 10% and display it on your website.

Precision = Did I show something negative?

Recall = How good I am at finding positive reviews?

Analogy with Optimist and Pessimist

Optimist assigns every/most review as positive
- Very good recall, but less precision
Pessimist assigns every/most review with negative
- Bad recall, good precision

Trade-off

Trade-off comes while scoring, not while training
We can assign labels based on probabilities
Decision tree gives probability by no of positive and negative samples at leaf node
Logistic regression of-course gives probability
We can change threshold to trade off between precision and recall
Positive when prob > 1 => Pessimist
Positive when prob > 0 => Optimist

Single no not always useful

Single numbers like F1 score and AUC are something I am not great fan of
You can not always choose classifier just by AUC, ROC curve might intersesct
- This intersection means that one classifier is better at some range of precision
- But if they don’t intersect we choose the one with higher AUC
From business perspective we are should be clear whether we want more precision or recall
Another practical metric they talked about was precision at k
- Say I want to display 5 reviews on my website
- What is the precision after 5 values I have chosen

Step Size in Descent Methods

September 20, 2018September 20, 2018Archit Vora 1 Comment

Descent Method

When the descent direction is opposite to gradient is is called gradient descent.
We also have steepest descent and newton’s algorithm
In this post we will focus on line search
Term is called ‘line search’ because step size t determines where along the line {x + t ∇ x } next iterate will be. (ray search would be more appropriate term)

Fixed Step Size

Here we keep the step size (t) constant for all the iteration
Our solution many not converge if t is too large
If can take a lot of time if t is too small

Exact line search

It is used when cost of minimizing problem with one variable (s) is less than computing direction (∇ x)

Backtracking Line Search

Exact line search has to solve optimization problem and become computational inefficient
This is most widely used in practice.
Here we don’t find optimal value of t, but some approximation

Demo of Convergence

References

Youtube series on unconstrained minimization : https://www.youtube.com/watch?v=-kwZhTPAhIQ
Wonderful lecture series by IISC professor Shirish Shevade : https://nptel.ac.in/courses/106108056/
Evergreen book : Convex optimization by Boyd

knn and kernel regression

September 20, 2018May 21, 2023Archit Vora Leave a comment

This is to summarize learning from course by University of Washington hosted on Coursera.

Parametric vs Non parametric

Parametric models have a predefined complexity, meaning the complexity is fixed regardless of the number of observations. On the other hand, non-parametric models allow the complexity to grow as the number of observations increases.

Infinite Noiseless Data

When dealing with infinite noiseless data, it is important to note that quadratic fit introduces some bias. However, 1-NN (nearest neighbor) can achieve zero RMSE (root mean squared error).

Examples of Non-parametric Models

Non-parametric models include kNN (k-nearest neighbors), kernel regression, spline, and trees. These models do not make strong assumptions about the underlying data distribution and can adapt to varying complexities based on the observed data.

1 NN

n the nearest neighbor (NN) prediction, we identify the closest data point to the query point, and the response of that nearest point is considered as our prediction.

Voronoi Tesselation

When working with multidimensional data, plotting the nearest neighbor prediction results in a Voronoi tesselation, also known as a Voronoi diagram. This diagram divides the space into regions based on the closest data point for each query point.

Distance Metrics

Distance metrics play a crucial role in nearest neighbor algorithms. Some commonly used distance metrics include:

Euclidean distance: Measures the straight-line distance between two points in the feature space.
Scaled Euclidean distance: Allows for different weights on different dimensions, which is useful when certain features carry more importance. For example, in predicting house prices, the square footage may be weighted more heavily than the number of floors.
Other examples of distance metrics include Mahalanobis distance, rank-based distance, correlation-based distance, cosine similarity, Manhattan distance, and Hamming distance.

1 NN in Practice

The 1 NN algorithm performs well when the data is dense. However, there are limitations to its effectiveness:

In non-dense data regions, it struggles with interpolating between observations.
It is sensitive to noise, as a single noisy data point can significantly impact the prediction.
1 NN tends to overfit the training data, resulting in poor generalization to new observations.

To address these limitations, the k-nearest neighbors (kNN) algorithm is often employed.

k-Nearest Neighbors (kNN)

In the kNN algorithm, we consider the k nearest neighbors and use their responses as predictions. This approach helps reduce the impact of noise compared to using only the nearest neighbor (1NN).

Challenges:

Boundary issues: When dealing with boundaries, the same points may repeatedly appear as nearest neighbors, resulting in a flat response.
Sparse region issues: In sparse regions, the same points may also be repeatedly chosen as neighbors, leading to potential inaccuracies.
Discontinuity of fit: The fit obtained using kNN may not be smooth since one neighbor can suddenly be excluded from the set of nearest neighbors.

To address these challenges, we can employ weighted kNN, which assigns different weights to each neighbor based on their proximity.

Weighted k-Nearest Neighbors (kNN):

In weighted kNN, less weight is assigned to neighbors that are farther away from the query point. This approach helps mitigate the issue of fit discontinuity in standard kNN.

There are two common weighing schemes used in weighted kNN:

Simple weighing scheme: In this scheme, the weight assigned to each neighbor is inversely proportional to its distance from the query point. The formula for weight calculation is: weight = 1 / distance.
Sophisticated weighing scheme using kernels: Kernels, such as the Gaussian kernel, are employed to assign weights. The Gaussian kernel never reaches zero, ensuring that even distant neighbors have some influence. On the other hand, kernels like the Uniform or triangular kernel eventually reach zero, diminishing the influence of distant points. The parameter λ is used to control how quickly the kernel reaches zero. A faster decay indicates that distant points will have less influence.

By incorporating weighting schemes in kNN, we can achieve a more nuanced and smoother fit, addressing the issue of fit discontinuity encountered in standard kNN.

Kernel Regression:

Kernel regression is an alternative approach to kNN, where instead of considering only k neighbors, we consider all observations in the dataset. This allows for a more continuous and smoother prediction.

The choice of kernel in kernel regression can be either bounded, such as the uniform or triangular kernel. In such cases, we consider a subset of neighbors, but it is not strictly a kNN approach.

When performing kernel regression, two decisions need to be made:

Choice of kernel: The choice of kernel has a relatively smaller impact on the prediction. Various kernels can be used, and their selection is typically based on the specific problem at hand.
Choice of bandwidth: The bandwidth plays a more significant role in the prediction. It determines the spread of the kernel before it reaches zero. A small bandwidth can lead to overfitting, capturing noise and local fluctuations, while a large bandwidth can result in an overly smoothed fit, leading to high bias. The bandwidth can be tuned using the kernel’s parameter λ, which can be selected through techniques such as cross-validation.

By considering all observations and utilizing kernels in kernel regression, we can achieve a more flexible and continuous prediction model, providing a trade-off between bias and variance based on the choice of bandwidth.

Local Linear Regression

Up until now, we have discussed the use of weighted averages for prediction. However, an alternative approach is to fit a model in the vicinity of the prediction point, where the errors are weighted by a kernel.

In local linear regression, we can fit either a linear model or a quadratic model near the prediction point. A linear model is particularly useful for addressing boundary effects, as it provides a linear prediction instead of a constant one.

On the other hand, a quadratic fit can handle curvature in the data but may introduce higher variance in the predictions. Consequently, in practice, a linear fit is often recommended due to its favorable balance between bias and variance.

By employing local linear regression, we can adapt the model’s behavior based on the local data characteristics, enabling more accurate predictions near the prediction point.

Global vs Local Fit

In certain situations, we may find that a linear fit is appropriate for some regions of the input space, while a quadratic fit is more suitable for others. However, determining the breakpoints where the underlying structure changes can be challenging.

Non-parametric models come to the rescue in such cases. Instead of assuming a specific functional form, these models allow for more flexibility in capturing the underlying patterns.

One example of a global fit is taking the average as a constant prediction. This approach provides a simple representation of the data but assumes a uniform behavior across all observations.

On the other hand, kernel regression offers a way to achieve a local fit by applying different weights to each observation. Nearer observations receive higher weights, enabling the model to adapt to local variations in the data.

By leveraging non-parametric models like kernel regression, we can capture both global and local characteristics of the data, allowing for more accurate and adaptive predictions based on the proximity of observations.

Limitations of Non-parametric Approaches

While non-parametric approaches like kNN offer flexibility and adaptability, they do have certain limitations.

Dimensionality: In higher dimensions, non-parametric models require an exponentially large number of observations to accurately capture the underlying patterns. As the number of dimensions increases, the data becomes more sparse, making it challenging to find sufficient neighboring points for accurate predictions.
Data Availability: Non-parametric models heavily rely on the availability of a large amount of data. When the dataset is limited or scarce, it becomes difficult to leverage the full potential of non-parametric approaches.
Computational Complexity: Non-parametric models, such as kNN, involve a brute force search to find the nearest neighbors. This search operation has a complexity of O(N) for 1-NN and O(N log K) for k-NN, where N is the number of observations. While these complexities can be manageable for smaller datasets, they can become computationally expensive as the dataset grows larger.

To mitigate some of these limitations, techniques like clustering can be employed to reduce the search space and improve computational efficiency.

In situations where the dataset is limited or high-dimensional, parametric models offer an alternative by making certain assumptions about the underlying structure. These models provide a more compact representation and are often more suitable when data availability or computational complexity is a concern.

Understanding the limitations of non-parametric approaches helps in selecting the most appropriate modeling technique based on the specific characteristics of the dataset and the available resources.

References

Course by University of Washington

https://www.coursera.org/learn/ml-regression

Generalized Linear Models (GLM)

September 18, 2018October 25, 2020Archit Vora 1 Comment

In standard linear regression we make two assumption :

P(Y/X) is a normal distribution
Mean is a linear function of parameter µ = β*X
P(Y/X) = Ν(µ, σ^2* I) # σ is standard deviation and I is identity matrix

In GLM we relax two things :

P(Y/X) is from any exponential family
Mean is some function of β*X
1. µ = f(β*X)
2. g(µ) = β*X
3. g = f^(-1)
4. g is called link function

Example of link functions:

log link
reciprocal link
logistic link

Derivation of log-likelihood matches that of normal distribution. However closed form solution is not defined and is generally solved by least square and convex optimization.

Here is one example from MIT course mentioned in references.

poisson

Logistic

In gaussian regression we predict μ for each sample
- This μ comes from β0, β1, β2 which are same for each sample
For binomial regression we want to predict p for each sample
- This p comes from β0, β1, β2 which are same for each sample
One option :
- p = β0 + β1*x1 + β2*x2
Second option
- p = sigmoid (β0 + β1*x1 + β2*x2)
- f(p) = log(p/(1-p)) = β0 + β1*x1 + β2*x2
- It is logit link function
What are other options apart from sigmoid
- step function (Not differentiable, that is why we use (sigmoid)
- tanh is sometime used in deep learning
What if we go with option 1:
- Binomial distribution requires p to be in (0,1)
Example :
- How many fishes survive (alive/dead) given food and water

Poisson

Poisson distribution models probability of observing count
- - P(k) = exp(-λ) * (λ^k) / k !
Parameter λ >= 0
Option one:
- λ = β0 + β1*x1 + β2*x2
Option two:
- λ = exp ( β0 + β1*x1 + β2*x2 )
- f ( λ ) = log ( λ ) = ( β0 + β1*x1 + β2*x2 )
- It is log link function
What if we go with option one:
- We want λ > 0
- Relationship between input and output is not additive but multiplicative ?
  - Suppose the seeds have germinated as many as 1.5 times by the enough water and as many as 1.2 times by the enough fertilizer. When you give both enough water and enough fertilizer, the seeds would germinate as many as 1.5 + 1.2 = 2.7 times ?
    Of course, it’s not. The estimated value would be 1.5 * 1.2 = 1.8 times. [3]
Example:
- How many seed will germinate given water and fertilizer

Parameter Estimation

We can do maximum likelihood estimate and find parameters β0, β1, β2
Deriving maximum likelihood for binomial:
- max_lh = Multiply (Likelihood of y_acutal of each sample for predicted distribution)
- max_lh = Multiply ( Binomial(p) )
- max_lh = Multiply ( p if y=1 else (1-p) )
- log(max_lh) = Summation (y*logp + (1-y) log (1-p))
Deriving maximum likelihood for Poisson:
- max_lh = Multiply (Likelihood of y_acutal of each sample for predicted distribution)
- max_lh = Multiply ( Poisson (u) )
- max_lh = Multiply (exp(-u) * u^y / y! )
- log(max_lh) = summation ( -u + y*log(u) – log (y!) )
Above two are rough derivations but conveys the idea
For Gaussian it turns out to OLS (Ordinary Least Squares) and has closed form solution
For other we solve it via gradient/newton’s method.

References :

[0] https://ocw.mit.edu/courses/mathematics/18-650-statistics-for-applications-fall-2016/lecture-slides/MIT18_650F16_GLM.pdf

[1] Wonderful MIT lecture : https://www.youtube.com/watch?v=X-ix97pw0xY

[2] https://onlinecourses.science.psu.edu/stat504/node/216/

[3] https://tsmatz.wordpress.com/2017/08/30/glm-regression-logistic-poisson-gaussian-gamma-tutorial-with-r/

Exponential Family

September 18, 2018September 18, 2018Archit Vora Leave a comment

Here is the basic concept :

simple_exponential

θ are parameters and X is data, both can be multidimensional
We want to restrict terms inside exponential to the form θ*X

Formal Definition:

formal_defination

η and T functions also help for the case when there is a mismatch in the dimension of θ and X.
g(θ) in basic concept above has been taken into exponential as B(θ).
- It vaguely serves as normalization factor.
h(x) serves a distribution and exponential transfer this basic distribution

examples

Subgradient Methods

September 17, 2018May 7, 2020Archit Vora Leave a comment

subgradient

They are used for non differentiable function
Unlike gradient it is not a descent and function value can often increase as well
- To combat this we keep the values of best point found so far
Also step size needs to be predefined
- Line search option of gradient descent does not work here

One application is in Lasso.

Lasso_subgrad

References

Click to access subgrad_method.pdf

Distance Metrics

Brute Force

KD-Trees

LSH – Locality Sensitive Hashing

References

Over-fitting

Early Stopping

Pruning

References:

Missing Values

Purification by Skipping

Imputation

Algo specific technique

References:

Descent Method

Fixed Step Size

Exact line search

Backtracking Line Search

Demo of Convergence

References

Parametric vs Non parametric

1 NN

Voronoi Tesselation

Distance Metrics

1 NN in Practice

k-Nearest Neighbors (kNN)

Weighted k-Nearest Neighbors (kNN):

Kernel Regression:

Local Linear Regression

Global vs Local Fit

Limitations of Non-parametric Approaches

References

Logistic

Poisson

Parameter Estimation

References :

Here is the basic concept :

Formal Definition:

Further Reading

References