K-Means Clustering

This post is a lecture summary of a course by the University of Washington available on Coursera: Machine Learning: Clustering and Retrieval.

Application

We want to discover groups of articles (e.g., sports, world news) without knowing the labels in advance. We can assign labels after finding clusters. Our goal is to learn user preferences, whether she likes sports/technology etc.

Users provide feedback on whether they liked an article or not. We can accumulate this feedback within specific clusters to understand preferences.

Contrast this with building a classification model that predicts whether a user will like a specific article. For that, we would need to build a classifier for each user and score each (article, user) pair.

Alternatively, we can determine the types of articles each user prefers and show nearby articles using K-nearest neighbors (KNN). Clustering offers a more elegant and simplistic approach.

Unsupervised Task

Clustering is an unsupervised task since there are no input-output labels. Unsupervised learning poses challenges:

  • Defining clusters can be easy or hard, depending on the case.
  • Our algorithm works well for cases with intermediate difficulty.
  • There are still unsolved clustering problems.
unsolved_clustering

Clustering in General

A cluster is defined by its center or centroid and its shape, often represented by an ellipsoid.

K-Means

Scoring in K-means involves assigning a point to a given cluster. The scoring mechanism is the distance to the cluster center.

K-means alternates between two steps:

  1. Assign data to clusters.
  2. Update cluster centers.
step1
step2

Both steps are minimization steps, making K-means resemble a coordinate descent algorithm. We alternate between minimizing in two directions: μ (cluster assignment) and z (cluster center).

Convergence

K-means is a non-convex optimization problem, so the global minimum is not guaranteed. We aim to find zᵢ and μᵢ that minimize the objective function.

k_means_objective

By optimizing via coordinate descent, K-means can converge to a local minimum. However, convergence depends on the initialization. To improve convergence, we use smart initialization as described below.

Smart Initialization

  1. Pick the first center randomly.
  2. Calculate the distance to the nearest center for all data points.
  3. Choose the next center with a probability proportional to the squared distance.

To perform probability sampling, calculate the cumulative probability as described in [1]. When K-means is followed by this initialization, it is known as K-means++.

Trade-offs

K-means++ takes more time for initialization but converges faster. Additionally, the converged solution is better than that of regular K-means.

Assessing Quality

Quality can be assessed by evaluating the objective value after convergence, which is referred to as cluster heterogeneity. A smaller heterogeneity indicates a better algorithm. Tighter clusters are more homogeneous.

Beware of overfitting: as K increases, heterogeneity always decreases.

Choosing K

To choose K, you can heuristically plot heterogeneity for various values of K and select the one at the elbow. Alternatively, you can use hierarchical clustering, which doesn’t require specifying K in advance. We can also perform Silhouette Analysis.

References

[1]: Stack Overflow: How exactly does k-means++ work?

Nearest Neighbour Search

 

This post is a lecture summary of a course by the University of Washington available on Coursera: Machine Learning: Clustering and Retrieval.

Problem: Document Retrieval

Given a document, we want to search for similar documents in a corpus.

Representation: How to Represent Documents as Vectors?

  • Bag of Words: Count the frequency of each word in the vocabulary.
  • TF-IDF: Calculate the IDF (inverse document frequency) of each word in the vocabulary. Weight word frequency (TF) by its IDF in a given document.
    • TF: Same as bag of words.
    • IDF: Higher for rare words, calculated as log(D_total / (1 + D_word)).

Distance Metrics

There are different distance metrics used to measure similarity or dissimilarity between vectors. Let’s explore some commonly used distance metrics:

Euclidean Distance:

  • Euclidean distance between two vectors x and y can be calculated using the vectorized form: sqrt((x-y).T.(x-y)).
  • It is beneficial to normalize features because features can have different units and varying ranges.
  • Normalization can be achieved by dividing either by the range or by the variance of the features.

Scaled Euclidean Distance:

  • Scaled Euclidean distance allows for weighting different dimensions differently based on their importance.
  • For example, in text analysis, we may want to give more weight to the title/abstract than the body of an article.
  • This can be achieved by summing up the word vectors of the article and title and using a diagonal matrix A that contains weights for each feature.
  • The vectorized form for scaled Euclidean distance is: sqrt((x-y).T.A.(x-y)).

Cosine Similarity:

  • Cosine similarity measures the cosine of the angle between two vectors.
  • Similarity = x.dot(y).
  • Distance = 1 – similarity.
  • Cosine similarity is not a proper “distance” metric because it does not satisfy the triangular property (|ab| + |bc| > |ac|).
  • Cosine similarity is commonly used in Natural Language Processing (NLP) tasks due to its computational efficiency for sparse features.
  • Feature vectors in NLP tend to be sparse.
  • The range of similarity values is generally between -1 and 1. However, when features are non-negative, the range becomes 0 to 1, which is the case with TF-IDF.
  • Normalization can be performed instead of using the inner product: Similarity = x.dot(y) / (|x||y|).
  • Normalization avoids doubling the length of documents by copying.
  • One side effect of normalization is that long documents and tweets may appear similar, which is undesired when suggesting tweets to someone reading a document.
  • To address this, an alternative approach is to cap the maximum word count that can appear in a vector.

Cosine vs Euclidean:

  • The choice between cosine and Euclidean distance depends on the use case.
  • Cosine distance is more suitable for text features where the magnitude of the vector is not as important as the orientation.
  • Euclidean distance is a better choice for count features, such as measuring how many times a document was read.
  • For mixed feature types, both cosine and Euclidean distances can be computed (possibly with a subset of relevant features) and combined using appropriate weights, considering the trade-offs and requirements of the specific problem.

 

Brute Force

  • Calculate distance to each document in corpus
  • Suppose there are N documents in corpus
  • Complexity  would be O(N) for 1-NN
  • It will be O(N log k) for k-NN
    • Using priority queue
  • It is computationally inefficient when we N is large and we need to query often

KD-Trees

KD-Trees are used to divide the space into hierarchical regions via a tree structure. Here’s how it works:

  • At each node, we store the smallest bounding box that contains all the data points in that region.
  • While querying (for 1 nearest neighbor, which can be extended for k nearest neighbors):
    • First, we find the leaf node where the query point belongs.
    • Then, we find the nearest neighbor within that node.
    • We keep moving up in the tree.
    • If the distance between the query point and the bounding box is greater than the distance found so far, we skip that region and move up the tree.
    • If it is less, we traverse down the tree.
  • To calculate the distance between a point and a rectangle, you can use appropriate distance metrics like Euclidean distance or other suitable measures. It is computationally easy rectangles are axis aligned[1]

Construction of KD-Trees:

Here are the steps involved in constructing KD-Trees:

  • Choose a splitting strategy:
    • Widest dimension: Split based on the dimension with the widest range.
    • Alternating dimension: Split alternately between dimensions.
  • Determine the value to split on (split value):
    • Median: Choose the median value of the points along the splitting dimension.
    • Center point: Compute the center point as the average of the left and right extremes.
  • Stop conditions for splitting:
    • If there are fewer points left than a specified threshold.
    • If the minimum width of the region is achieved.

How to Prune More Aggressively:

To prune more aggressively in KD-Trees, consider the following approach:

  • Suppose the nearest distance found is r.
  • Instead of pruning points that are farther than distance r, prune all points that are farther than (r/a), where a > 1.
  • This allows for more aggressive pruning and can speed up the search process.

Limitations of KD-Trees:

KD-Trees face challenges in high-dimensional spaces. Here are some limitations:

  • In high dimensions, the radius of the search space is generally large.
  • For d dimensions, the radius intersects 2^d hypercubes, resulting in an extensive search space.
  • One possible solution is to prune all points that are farther away than (r/a), where a > 1, to mitigate these issues.

 

LSH – Locality Sensitive Hashing

Locality Sensitive Hashing (LSH) is a technique used to classify data points into various bins and search within a given bin. Here’s how it works:

  • The idea is to group data points into bins based on their proximity and search within the relevant bin(s).
  • If necessary, nearby bins can also be explored to expand the search.
  • To determine nearby bins, multiple random lines (or planes in higher dimensions) are used.
  • Let’s consider a motivating example with points in a 2D plane:
    • A single random line passing through the origin is chosen.
    • Points that are closer to each other will mostly fall into the same bin.
  • However, using just one line can lead to more data points in a single bin, increasing query time.
  • The solution is to use multiple random lines (or planes) to achieve more efficient classification of neighboring bins.
  • For example, let’s consider three lines, f1, f2, and f3, defined as f(x,y) = 3x + 4y = 0:
    • If f1(p,q) < 0, f2(p,q) > 0, and f3(p,q) > 0 for a given point (p,q), we assign it to the bucket [0,1,1].
  • To facilitate searching within bins, a dictionary can be used, where the key represents the bin (e.g., [0,1,1], [0,1,0]), and the value is a list of all points in that bin.
  • To find nearby bins:
    • Flip one bit to find bins with a single bit difference.
    • For more exploration, flip two bits to find bins with two bits difference.
  • It is possible for the nearest point to be in a different bin.
  • The number of neighboring bins to explore depends on factors such as computational budget or a predefined quality threshold.
  • In some cases, an exact nearest neighbor may not be necessary, and a point with a predefined quality (epsilon) is sufficient.
  • LSH is useful in scenarios like document suggestion, where similar documents need to be retrieved efficiently.
  • LSH can also be extended to high-dimensional spaces by using planes or hyperplanes instead of lines.
  • For assigning bins to data points, dot product calculations are performed, which work well even in large dimensions, especially when the feature vectors are sparse.

References

[1] https://stackoverflow.com/questions/5254838/calculating-distance-between-a-point-and-a-rectangular-box-nearest-point

[2] https://gamedev.stackexchange.com/questions/44483/how-do-i-calculate-distance-between-a-point-and-an-axis-aligned-rectangle

Overfitting in Decision Trees

We had previously written about classification trees and regression trees.

In this blog we will mainly write about preventing over fitting in decision trees.

Handling missing values for decision tree is described here.

Over-fitting

  • For decision trees as depth increases training error always go towards zero.
  • Occam’s razor philosophy
    • Suppose there exists two explanation for an occurrence
    • Then less complex explanation is generally correct
    • You have headache and stomachache.
    • Disease D1 explains headache, D2 explains stomachache, while D3 explains both
    • Instead of diagnosing patient with both D1 and D2, saying it D3 would be more accurate in general
  • So when you have two trees with same validation error, choose one with lesser complexity
  • There are two approaches:
    • Early Stopping
    • Pruning

Early Stopping

There can be three stopping condition

  1. Pre define max_depth
    1. It can be tuned via cross validation, which increases training time
    2. Does not work well when you have less data
  2. Stop if decrease in classification error is less than some threshold
    1. It can become tricky as in case of XOR
    2. XOR is very famous case in classification literature 
  3. Stop if very few point are left in the node
    1. This works pretty well in practice
    2. Stop if node has say 10 to 100 sample

Pruning

  • Pruning generally is better solution than early stopping
  • Here we build the entire tree first and than remove certain node
  • Criterion for removing nodes:
    • cost(tree) = error(tree) + λ*L(tree)
    • L(tree) = no of leaf nodes
  • if cost(deep_tree) > cost(small_tree) then don’t split that node
  • Algorithm:
    • Start from bottom up the tree and traverse up and test one node each time
    • Calculate cost
    • Make decision to prune it or not

References:

Course by University of Washington

https://www.coursera.org/learn/ml-classification

Handling missing values in Decision Tree

We had previously written about classification trees and regression trees.

In this blog we will mainly write about handling missing values specifically for decision trees.

Handling over-fitting for decision tree is described here.

Missing Values

Missing value can cause issues at :

  1. Training time
  2. Prediction time
    1. What if while prediction sample does not have value for some feature

Purification by Skipping

  • This approach involves skipping rows or entire features that contain missing values.
  • However, this method comes with a drawback as it results in the loss of valuable information.
  • It is important to note that skipping missing values during training will not resolve the issue during prediction time

Imputation

  • Imputation refers to the process of filling in missing values with estimated or imputed values.
  • It provides a means to address missing values during both training and prediction.
  • A dictionary can be created to store imputed values for each feature, which can then be utilized during prediction.
  • For categorical features, a common imputation method is to use the mode, while for numeric features, the average or median is often employed.
  • Advanced techniques, such as the expectation maximization algorithm, can also be utilized for imputation.

However, it’s essential to be aware that imputation can introduce systematic biases into the data. For example, let’s consider a scenario where people from Washington tend to not provide their age, while the rest of the United States does. This could lead to a systematic bias in the imputed values, as the imputation process may assign ages based on data from other regions.

Algo specific technique

In decision tree algorithms, there are specific techniques to handle missing values. At each node in the decision tree, we need to determine which branch the missing value will take. It’s possible for the same feature to have different values at different nodes. To address this, we can make a tweak in the decision tree algorithm:

  1. During feature selection, consider assigning the missing value to each possible branch.
  2. Evaluate the effect of assigning the missing value to each branch by measuring the reduction in classification error.
  3. Choose the feature and branch that result in the highest reduction in classification error when the missing value is assigned.

However, what if we encounter a node during prediction where we didn’t have any missing values during training? In such cases, we can store the direction with the most number of samples at that node as the default direction for missing values. Let’s call this direction “max_sample.”

If we encounter missing values while training the decision tree, we can override the default “max_sample” direction with the observed direction for the missing values.

By incorporating these techniques, we can effectively handle missing values in decision trees and make informed decisions based on the available data.

References:

Course by University of Washington

https://www.coursera.org/learn/ml-classification

On Classification Accuracy – 2

We have already talked about it in this post. Just want to add few more things after finishing a course. This post is just an extension of above with some practical considerations.

We are claiming that accuracy may not be a good measure always. When you are building automated machine learning you must trust it.

Case Study

  • You want to show positive reviews on your website.
  • Say in your dataset 90% reviews are negative.
  • A classifier can achieve 90% accuracy by predicting all of them as negative.
  • But what you are interested in is finding out remaining 10% and display it on your website.

 

Precision = Did I show something negative?

Recall = How good I am at finding positive reviews?

 

Analogy with Optimist and Pessimist

  • Optimist assigns every/most review as positive
    • Very good recall, but less precision
  • Pessimist assigns every/most review with negative
    • Bad recall, good precision

 

Trade-off

  • Trade-off comes while scoring, not while training
  • We can assign labels based on probabilities
  • Decision tree gives probability by no of positive and negative samples at leaf node
  • Logistic regression of-course gives probability
  • We can change threshold to trade off between precision and recall
  • Positive when prob > 1 => Pessimist
  • Positive when prob > 0 => Optimist

 

Single no not always useful

  • Single numbers like F1 score and AUC are something I am not great fan of
  • You can not always choose classifier just by AUC, ROC curve might intersesct
    • This intersection means that one classifier is better at some range of precision
    • But if they don’t intersect we choose the one with higher AUC
  • From business perspective we are should be clear whether we want more precision or recall
  • Another practical metric they talked about was precision at k
    • Say I want to display 5 reviews on my website
    • What is the precision after 5 values I have chosen

 

 

Step Size in Descent Methods

Descent Method

general_descent

  • When the descent direction is opposite to gradient is is called gradient descent.
  • We also have steepest descent and newton’s algorithm
  • In this post we will focus on line search
  • Term is called ‘line search’ because step size t determines where along the line {x + t ∇ x } next iterate will be. (ray search would be more appropriate term)

 

Fixed Step Size

  • Here we keep the step size (t) constant for all the iteration
  • Our solution many not converge if t is too large
  • If can take a lot of time if t is too small

 

faster_iteration

 

Exact line search

exact_line_search

  • It is used when cost of minimizing problem with one variable (s) is less than computing direction (∇ x)

 

Backtracking Line Search

  • Exact line search has to solve optimization problem and become computational inefficient
  • This is most widely used in practice.
  • Here we don’t find optimal value of t, but some approximation

inexact_line_searchonethreetwo

 

Demo of Convergence

 

faster_iteration

 

 

References

 

knn and kernel regression

This is to summarize learning from course by University of Washington hosted on Coursera.

 

Parametric vs Non parametric

Parametric models have a predefined complexity, meaning the complexity is fixed regardless of the number of observations. On the other hand, non-parametric models allow the complexity to grow as the number of observations increases.

Infinite Noiseless Data

When dealing with infinite noiseless data, it is important to note that quadratic fit introduces some bias. However, 1-NN (nearest neighbor) can achieve zero RMSE (root mean squared error).

Examples of Non-parametric Models

Non-parametric models include kNN (k-nearest neighbors), kernel regression, spline, and trees. These models do not make strong assumptions about the underlying data distribution and can adapt to varying complexities based on the observed data.

1 NN

n the nearest neighbor (NN) prediction, we identify the closest data point to the query point, and the response of that nearest point is considered as our prediction.

Voronoi Tesselation

When working with multidimensional data, plotting the nearest neighbor prediction results in a Voronoi tesselation, also known as a Voronoi diagram. This diagram divides the space into regions based on the closest data point for each query point.

Distance Metrics

Distance metrics play a crucial role in nearest neighbor algorithms. Some commonly used distance metrics include:

  • Euclidean distance: Measures the straight-line distance between two points in the feature space.
  • Scaled Euclidean distance: Allows for different weights on different dimensions, which is useful when certain features carry more importance. For example, in predicting house prices, the square footage may be weighted more heavily than the number of floors.
  • Other examples of distance metrics include Mahalanobis distance, rank-based distance, correlation-based distance, cosine similarity, Manhattan distance, and Hamming distance.

1 NN in Practice

The 1 NN algorithm performs well when the data is dense. However, there are limitations to its effectiveness:

  • In non-dense data regions, it struggles with interpolating between observations.
  • It is sensitive to noise, as a single noisy data point can significantly impact the prediction.
  • 1 NN tends to overfit the training data, resulting in poor generalization to new observations.

To address these limitations, the k-nearest neighbors (kNN) algorithm is often employed.

 

k-Nearest Neighbors (kNN)

In the kNN algorithm, we consider the k nearest neighbors and use their responses as predictions. This approach helps reduce the impact of noise compared to using only the nearest neighbor (1NN).

Challenges:

  1. Boundary issues: When dealing with boundaries, the same points may repeatedly appear as nearest neighbors, resulting in a flat response.
  2. Sparse region issues: In sparse regions, the same points may also be repeatedly chosen as neighbors, leading to potential inaccuracies.
  3. Discontinuity of fit: The fit obtained using kNN may not be smooth since one neighbor can suddenly be excluded from the set of nearest neighbors.

To address these challenges, we can employ weighted kNN, which assigns different weights to each neighbor based on their proximity.

Weighted k-Nearest Neighbors (kNN):

In weighted kNN, less weight is assigned to neighbors that are farther away from the query point. This approach helps mitigate the issue of fit discontinuity in standard kNN.

There are two common weighing schemes used in weighted kNN:

  1. Simple weighing scheme: In this scheme, the weight assigned to each neighbor is inversely proportional to its distance from the query point. The formula for weight calculation is: weight = 1 / distance.
  2. Sophisticated weighing scheme using kernels: Kernels, such as the Gaussian kernel, are employed to assign weights. The Gaussian kernel never reaches zero, ensuring that even distant neighbors have some influence. On the other hand, kernels like the Uniform or triangular kernel eventually reach zero, diminishing the influence of distant points. The parameter λ is used to control how quickly the kernel reaches zero. A faster decay indicates that distant points will have less influence.

By incorporating weighting schemes in kNN, we can achieve a more nuanced and smoother fit, addressing the issue of fit discontinuity encountered in standard kNN.

Kernel Regression:

Kernel regression is an alternative approach to kNN, where instead of considering only k neighbors, we consider all observations in the dataset. This allows for a more continuous and smoother prediction.

The choice of kernel in kernel regression can be either bounded, such as the uniform or triangular kernel. In such cases, we consider a subset of neighbors, but it is not strictly a kNN approach.

When performing kernel regression, two decisions need to be made:

  1. Choice of kernel: The choice of kernel has a relatively smaller impact on the prediction. Various kernels can be used, and their selection is typically based on the specific problem at hand.
  2. Choice of bandwidth: The bandwidth plays a more significant role in the prediction. It determines the spread of the kernel before it reaches zero. A small bandwidth can lead to overfitting, capturing noise and local fluctuations, while a large bandwidth can result in an overly smoothed fit, leading to high bias. The bandwidth can be tuned using the kernel’s parameter λ, which can be selected through techniques such as cross-validation.

By considering all observations and utilizing kernels in kernel regression, we can achieve a more flexible and continuous prediction model, providing a trade-off between bias and variance based on the choice of bandwidth.

Local Linear Regression

Up until now, we have discussed the use of weighted averages for prediction. However, an alternative approach is to fit a model in the vicinity of the prediction point, where the errors are weighted by a kernel.

In local linear regression, we can fit either a linear model or a quadratic model near the prediction point. A linear model is particularly useful for addressing boundary effects, as it provides a linear prediction instead of a constant one.

On the other hand, a quadratic fit can handle curvature in the data but may introduce higher variance in the predictions. Consequently, in practice, a linear fit is often recommended due to its favorable balance between bias and variance.

By employing local linear regression, we can adapt the model’s behavior based on the local data characteristics, enabling more accurate predictions near the prediction point.

Global vs Local Fit

In certain situations, we may find that a linear fit is appropriate for some regions of the input space, while a quadratic fit is more suitable for others. However, determining the breakpoints where the underlying structure changes can be challenging.

Non-parametric models come to the rescue in such cases. Instead of assuming a specific functional form, these models allow for more flexibility in capturing the underlying patterns.

One example of a global fit is taking the average as a constant prediction. This approach provides a simple representation of the data but assumes a uniform behavior across all observations.

On the other hand, kernel regression offers a way to achieve a local fit by applying different weights to each observation. Nearer observations receive higher weights, enabling the model to adapt to local variations in the data.

By leveraging non-parametric models like kernel regression, we can capture both global and local characteristics of the data, allowing for more accurate and adaptive predictions based on the proximity of observations.

Limitations of Non-parametric Approaches

While non-parametric approaches like kNN offer flexibility and adaptability, they do have certain limitations.

  1. Dimensionality: In higher dimensions, non-parametric models require an exponentially large number of observations to accurately capture the underlying patterns. As the number of dimensions increases, the data becomes more sparse, making it challenging to find sufficient neighboring points for accurate predictions.
  2. Data Availability: Non-parametric models heavily rely on the availability of a large amount of data. When the dataset is limited or scarce, it becomes difficult to leverage the full potential of non-parametric approaches.
  3. Computational Complexity: Non-parametric models, such as kNN, involve a brute force search to find the nearest neighbors. This search operation has a complexity of O(N) for 1-NN and O(N log K) for k-NN, where N is the number of observations. While these complexities can be manageable for smaller datasets, they can become computationally expensive as the dataset grows larger.

To mitigate some of these limitations, techniques like clustering can be employed to reduce the search space and improve computational efficiency.

In situations where the dataset is limited or high-dimensional, parametric models offer an alternative by making certain assumptions about the underlying structure. These models provide a more compact representation and are often more suitable when data availability or computational complexity is a concern.

Understanding the limitations of non-parametric approaches helps in selecting the most appropriate modeling technique based on the specific characteristics of the dataset and the available resources.

References

Course by University of Washington

https://www.coursera.org/learn/ml-regression

 

 

Generalized Linear Models (GLM)

In standard linear regression we make two assumption :

  1. P(Y/X) is a normal distribution
  2. Mean is a linear function of parameter µ  = β*X
  3. P(Y/X) = Ν(µ, σ^2* I)       # σ is standard deviation and I is identity matrix

 

In GLM we relax two things :

  1. P(Y/X) is from any exponential family
  2. Mean is some function of β*X
    1. µ = f(β*X)
    2. g(µ) = β*X
    3. g = f^(-1)
    4. g is called link function

 

Example of link functions:

  1. log link
  2. reciprocal link
  3. logistic link

 

Derivation of log-likelihood matches that of normal distribution. However closed form solution is not defined and is generally solved by least square and convex optimization.

Here is one example from MIT course mentioned in references.

poisson

Logistic

  • In gaussian regression we predict μ for each sample
    • This μ comes from β0, β1, β2 which are same for each sample
  • For binomial regression we want to predict p for each sample
    • This p comes from β0, β1, β2 which are same for each sample
  • One option :
    • p = β0 + β1*x1 + β2*x2
  • Second option
    • p = sigmoid (β0 + β1*x1 + β2*x2)
    • f(p) = log(p/(1-p)) = β0 + β1*x1 + β2*x2
    • It is logit link function
  • What are other options apart from sigmoid
    • step function (Not differentiable, that is why we use (sigmoid)
    • tanh is sometime used in deep learning
  • What if we go with option 1:
    • Binomial distribution requires p to be in (0,1)
  • Example :
    • How many fishes survive (alive/dead) given food and water

 

Poisson

  • Poisson distribution models probability of observing count
      • P(k) = exp(-λ) * (λ^k) / k !

    Parameter λ >= 0

  • Option one:
    • λ = β0 + β1*x1 + β2*x2
  • Option two:
    • λ = exp ( β0 + β1*x1 + β2*x2 )
    • f ( λ ) = log ( λ ) = ( β0 + β1*x1 + β2*x2 )
    • It is log link function
  • What if we go with option one:
    • We want λ > 0
    • Relationship between input and output is not additive but multiplicative ?
      • Suppose the seeds have germinated as many as 1.5 times by the enough water and as many as 1.2 times by the enough fertilizer. When you give both enough water and enough fertilizer, the seeds would germinate as many as 1.5 + 1.2 = 2.7 times ?
        Of course, it’s not. The estimated value would be 1.5 * 1.2 = 1.8 times. [3]
  • Example:
    • How many seed will germinate given water and fertilizer

 

Parameter Estimation

  • We can do maximum likelihood estimate and find parameters β0, β1, β2
  • Deriving maximum likelihood for binomial:
    • max_lh = Multiply (Likelihood of y_acutal of each sample for predicted distribution)
    • max_lh = Multiply ( Binomial(p) )
    • max_lh = Multiply ( p if y=1 else (1-p) )
    • log(max_lh) = Summation (y*logp + (1-y) log (1-p))
  • Deriving maximum likelihood for Poisson:
    • max_lh = Multiply (Likelihood of y_acutal of each sample for predicted distribution)
    • max_lh = Multiply ( Poisson (u) )
    • max_lh = Multiply (exp(-u) * u^y / y! )
    • log(max_lh) = summation ( -u + y*log(u) – log (y!) )
  • Above two are rough derivations but conveys the idea
  • For Gaussian it turns out to OLS (Ordinary Least Squares) and has closed form solution
  • For other we solve it via gradient/newton’s method.

 

References :

[0] https://ocw.mit.edu/courses/mathematics/18-650-statistics-for-applications-fall-2016/lecture-slides/MIT18_650F16_GLM.pdf

[1] Wonderful MIT lecture : https://www.youtube.com/watch?v=X-ix97pw0xY

[2] https://onlinecourses.science.psu.edu/stat504/node/216/

[3] https://tsmatz.wordpress.com/2017/08/30/glm-regression-logistic-poisson-gaussian-gamma-tutorial-with-r/

 

Exponential Family

Here is the basic concept :

simple_exponential

  • θ are parameters and X is data, both can be multidimensional
  • We want to restrict terms inside exponential to the form θ*X

 

Formal Definition:

formal_defination

  • η and T functions also help for the case when there is a mismatch in the dimension of θ and X.
  • g(θ) in basic concept above has been taken into exponential as B(θ).
    • It vaguely serves as normalization factor.
  • h(x) serves a distribution and exponential transfer this basic distribution

 

examples

 

Further Reading

Click to access chapter8.pdf

Click to access lecture12.pdf