Linear Regression Derivations

  • Suppose you have just two parameters a and b and have n observation
    • y = a + b*x
    • a and b can be found by simple optimization problem [1]
      • By setting partial derivative to 0
      • br2derivation

 

Simpler R2 formula is available at https://datastoriesweb.wordpress.com/2017/01/15/interpreting-statistical-values/

 

Matrix closed form Solution

gradient                           closed_form

 

Gradient Descent

  • We assume some initial values for parameters
  • We calculate gradient for each parameter and move in the negative direction of it by step size
  • descent

 

I have coded gradient descent here at [2]

 

 

References :

[1] http://seismo.berkeley.edu/~kirchner/eps_120/Toolkits/Toolkit_10.pdf

[2] https://github.com/arcarchit/datastories/blob/master/gradient_descent.ipynb

Bayesian Learning – Quick Summary

We had used Bayesian learning for house price prediction project, notebook is available at [0]. Purpose of this blog is to have quick summary of concept involved.

  • Bayesian learning allows us to have distribution for parameters rather than point estimate.
  • We keep sampling values of parameters say 10000 times. Histogram of last 6000 sample represent approximate posterior of parameter.
  • Every parameters, latent variables, output have distribution associated with them.
  • Top parameters in the hierarchy will have prior associated with them.
    • We sample for these top parameters
    • Calculate corresponding value for all latent variable coming down the tree
    • Finally at output calculate likelihood against observed data
  • Different MCMC algorithms differs in two steps:
    • How to jump to next to next sample
    • How to decide whether next sample is acceptable or not
  • Metropolis algorithm:
    • Jumps considering normal distribution at previous parameter value and some fixed standard deviation
    • Acceptance test : random(0,1) < ratio of (Pnew/Pold)
      • When Pnew is higher we will definitely accept
      • Else probability of acceptance depends on how large Pold is
      • If Pold = 2 * Pnew it is 50 %
    • Pnew = liklihood_new * prior_new
    • Pold = liklihood_old * prior_old
  • Example of other MCMC algorithms:
    • Metropolis-Hastings
    • The Gibbs Sampler
    • Hamiltonian MCMC
    • The No-U-Turn Sampler (NUTS)

Reference

[0] : https://github.com/arcarchit/House-Prices-Advanced-Regression-Techniques/blob/master/notebooks/Bayesian%2BMaster.ipynb

[1] : http://twiecki.github.io/blog/2015/11/10/mcmc-sampling/

[2] : https://www.quantstart.com/articles/Bayesian-Inference-of-a-Binomial-Proportion-The-Analytical-Approach

Probabilistic Clustering

This post serves as a lecture summary of a course offered by the University of Washington, which can be accessed at the following link: University of Washington – ML: Clustering and Retrieval.

Sample code related to the course can be found on GitHub at: Sample Code for EM Algorithm.

In this lecture, we will cover several important topics:

  1. Why we need probabilistic clustering: We will explore the motivations behind using probabilistic clustering methods, which allow for soft assignments to clusters rather than rigid assignments.
  2. Mixture models: We will dive into the concept of mixture models, which are a probabilistic approach to clustering. Mixture models allow data points to belong to multiple clusters with different probabilities.
  3. Expectation Maximization (EM): We will study the Expectation Maximization algorithm, a powerful iterative algorithm used to estimate the parameters of mixture models. EM alternates between the expectation step, where cluster assignments are updated based on current parameter estimates, and the maximization step, where the parameters are updated based on current cluster assignments.

Why we need probabilistic clustering ?

To understand the need for probabilistic clustering, let’s consider the problem of determining user preferences based on their feedback. Suppose users provide feedback on whether they liked or disliked an article. We want to gather this feedback and group it into clusters to gain insights into their preferences.

For instance, imagine a user who likes an article about the first human landing on Mars. This article could be relevant to both world news and science news categories. In such cases, assigning articles to clusters with a hard assignment (where an article belongs to only one cluster) may not capture the complete picture. Soft assignment to clusters, where an article can belong to multiple clusters with different probabilities, is more suitable.

Now, let’s explore some scenarios where the traditional K-means algorithm fails:

  1. Clusters with Different Spreads: When clusters have varying spreads, simply deciding based on a single sample may not provide an accurate representation. Instead, it is important to consider the number of data points within each cluster. Assigning an ambiguous point to a cluster that has more data points can be a more meaningful approach.
  2. Overlapping Clusters: In cases where clusters overlap, soft assignment becomes crucial. It allows for assigning data points to multiple clusters based on the degree of membership.
  3. Clusters with Different Shapes or Orientations: When clusters have distinct shapes or orientations, measuring the distance to the cluster center alone is insufficient. Considering the shape of the clusters is necessary for accurate clustering.

One approach to address these challenges is using K-means with scaled Euclidean distance. This modification results in clusters represented by axis-aligned ellipses, accommodating different spreads in each dimension. However, determining appropriate weights for each dimension in the calculation remains a question that needs to be addressed. Shape is still axis-aligned ellipses, which is a limitation in itself.

Motivating application

Let’s consider the task of clustering images based on their color composition. One approach is to compute the average values of the red (R), green (G), and blue (B) channels for each image. This results in representing each image as a single 3-dimensional vector [R, G, B], capturing the average color values.

By applying clustering algorithms to these image vectors, we can group similar images together based on their dominant color characteristics. Although our data is unlabeled, we can observe potential clusters such as:

  1. Sky Cluster (Blue Dominated): Images that predominantly contain blue colors, such as those depicting clear skies, can be grouped into this cluster.
  2. Sunset Cluster (Red Dominated): Images showcasing vibrant sunsets with warm hues and red tones can be grouped into this cluster.
  3. Forest Cluster (Green Dominated): Images featuring lush green forests, landscapes, or vegetation will likely be grouped into this cluster due to their dominant green color composition.

Mixture Models

  • Here is how distribution of blue might look for all images
blue

We can model image clustering using a mixture of Gaussian distributions. The model consists of three Gaussian components, each with its own parameters (μ and σ). The final distribution is a convex combination of these components: π1g1 + π2g2 + π3*g3, where 0 ≤ π ≤ 1 and the sum of all π values is equal to 1.

Each mixture component represents a unique cluster, characterized by its own set of parameters (πk, μk, σk). In the case of multidimensional data, such as image vectors, each individual distribution is a multivariate Gaussian distribution.

Without examining the image vector, we can determine the probability of it belonging to cluster k using the πk value. Given an image vector x, we can estimate the probability of it belonging to cluster k as P(x | params) = P(x | μk, Σk), where Σ represents the covariance matrix in the multivariate Gaussian.

Clustering with soft assignment involves calculating a responsibility vector for each data point. The length of the responsibility vector corresponds to the number of clusters, which in this case is three (sky, forest, sunset). Each element of the responsibility vector represents the probability of a data point belonging to a specific cluster.

In the case of document clustering, where the dimensionality is high (proportional to the number of words in a vocabulary), keeping track of parameters for each Gaussian becomes computationally expensive. To address this, we simplify the model by considering only diagonal values in the covariance matrix, effectively restricting the model to axis-aligned ellipses. This approach is similar to K-means with scaled Euclidean distance, which also allows for axis-aligned ellipses.

However, the advantage of the mixture model is twofold:

  1. We do not need to specify weights for each dimension in advance; the model learns them from the data.
  2. Different clusters can have varying weights for each dimension, allowing for more flexible modeling of the data.

EM

Responsibility vector calculation with known cluster parameters

  • The responsibility vector can be calculated if we know the cluster parameters: π_k, μ_k, and ∑_k.
  • The responsibility vector is different from the π values.
  • Given a data point, we can find its corresponding responsibility vector by estimating the probabilities in that cluster.
  • The responsibility vector needs to be normalized after evaluating the probabilities.
a
  • This process is a consequence of Bayes’ rule.

Estimating cluster parameters with known responsibility vectors

  • It is similar to parameter estimation of a multivariate Gaussian.
  • Each data point, x_i, is multiplied, and there is no need to worry about the denominator since it is also not N (the number of data points).
  • b1         b2      nk_soft
  • N_k_soft implies how many points belong to cluster 1
    • Also it is a soft assignment

EM Algorithm

  • E-step: Estimation of the responsibility vector given parameters.
  • M-step: Maximum likelihood estimation of the parameters given the responsibility vector.
  • These two steps are repeated until convergence.
  • The EM algorithm can be derived as a coordinate ascent algorithm and converges to a local mode.
  • The biased coin example provides a good illustration of the EM algorithm. Check the image below

mj0nb

Challenges

  • The probability becomes infinite when the variance becomes zero.
  • This can occur when there is only one data point in a cluster, and the mean is itself with zero variance.
  • In high-dimensional spaces, this often happens, such as when a document does not have a specific word at all.
  • To solve this issue, a small value is added to the diagonal elements of the covariance matrix, similar to adding a prior in Bayesian approaches.

Relationship to K-means:

  • When a Gaussian has the same variance in all dimensions and shrinks to zero, the likelihood becomes either 0 or 1.
  • This behavior is similar to hard assignment, resembling the K-means algorithm.

Initialisation

  • Choose k observations at random and assign observations to nearest centroid
  • Above via k-means++
  • Solution of k-means
  • Grow mixture model by splitting cluster (sometime merging) until K clusters are formed

I have coded simple EM at [0]

References

[0] : https://github.com/arcarchit/datastories/blob/master/EM.ipynb

K-Means Clustering

This post is a lecture summary of a course by the University of Washington available on Coursera: Machine Learning: Clustering and Retrieval.

Application

We want to discover groups of articles (e.g., sports, world news) without knowing the labels in advance. We can assign labels after finding clusters. Our goal is to learn user preferences, whether she likes sports/technology etc.

Users provide feedback on whether they liked an article or not. We can accumulate this feedback within specific clusters to understand preferences.

Contrast this with building a classification model that predicts whether a user will like a specific article. For that, we would need to build a classifier for each user and score each (article, user) pair.

Alternatively, we can determine the types of articles each user prefers and show nearby articles using K-nearest neighbors (KNN). Clustering offers a more elegant and simplistic approach.

Unsupervised Task

Clustering is an unsupervised task since there are no input-output labels. Unsupervised learning poses challenges:

  • Defining clusters can be easy or hard, depending on the case.
  • Our algorithm works well for cases with intermediate difficulty.
  • There are still unsolved clustering problems.
unsolved_clustering

Clustering in General

A cluster is defined by its center or centroid and its shape, often represented by an ellipsoid.

K-Means

Scoring in K-means involves assigning a point to a given cluster. The scoring mechanism is the distance to the cluster center.

K-means alternates between two steps:

  1. Assign data to clusters.
  2. Update cluster centers.
step1
step2

Both steps are minimization steps, making K-means resemble a coordinate descent algorithm. We alternate between minimizing in two directions: μ (cluster assignment) and z (cluster center).

Convergence

K-means is a non-convex optimization problem, so the global minimum is not guaranteed. We aim to find zᵢ and μᵢ that minimize the objective function.

k_means_objective

By optimizing via coordinate descent, K-means can converge to a local minimum. However, convergence depends on the initialization. To improve convergence, we use smart initialization as described below.

Smart Initialization

  1. Pick the first center randomly.
  2. Calculate the distance to the nearest center for all data points.
  3. Choose the next center with a probability proportional to the squared distance.

To perform probability sampling, calculate the cumulative probability as described in [1]. When K-means is followed by this initialization, it is known as K-means++.

Trade-offs

K-means++ takes more time for initialization but converges faster. Additionally, the converged solution is better than that of regular K-means.

Assessing Quality

Quality can be assessed by evaluating the objective value after convergence, which is referred to as cluster heterogeneity. A smaller heterogeneity indicates a better algorithm. Tighter clusters are more homogeneous.

Beware of overfitting: as K increases, heterogeneity always decreases.

Choosing K

To choose K, you can heuristically plot heterogeneity for various values of K and select the one at the elbow. Alternatively, you can use hierarchical clustering, which doesn’t require specifying K in advance. We can also perform Silhouette Analysis.

References

[1]: Stack Overflow: How exactly does k-means++ work?

Nearest Neighbour Search

 

This post is a lecture summary of a course by the University of Washington available on Coursera: Machine Learning: Clustering and Retrieval.

Problem: Document Retrieval

Given a document, we want to search for similar documents in a corpus.

Representation: How to Represent Documents as Vectors?

  • Bag of Words: Count the frequency of each word in the vocabulary.
  • TF-IDF: Calculate the IDF (inverse document frequency) of each word in the vocabulary. Weight word frequency (TF) by its IDF in a given document.
    • TF: Same as bag of words.
    • IDF: Higher for rare words, calculated as log(D_total / (1 + D_word)).

Distance Metrics

There are different distance metrics used to measure similarity or dissimilarity between vectors. Let’s explore some commonly used distance metrics:

Euclidean Distance:

  • Euclidean distance between two vectors x and y can be calculated using the vectorized form: sqrt((x-y).T.(x-y)).
  • It is beneficial to normalize features because features can have different units and varying ranges.
  • Normalization can be achieved by dividing either by the range or by the variance of the features.

Scaled Euclidean Distance:

  • Scaled Euclidean distance allows for weighting different dimensions differently based on their importance.
  • For example, in text analysis, we may want to give more weight to the title/abstract than the body of an article.
  • This can be achieved by summing up the word vectors of the article and title and using a diagonal matrix A that contains weights for each feature.
  • The vectorized form for scaled Euclidean distance is: sqrt((x-y).T.A.(x-y)).

Cosine Similarity:

  • Cosine similarity measures the cosine of the angle between two vectors.
  • Similarity = x.dot(y).
  • Distance = 1 – similarity.
  • Cosine similarity is not a proper “distance” metric because it does not satisfy the triangular property (|ab| + |bc| > |ac|).
  • Cosine similarity is commonly used in Natural Language Processing (NLP) tasks due to its computational efficiency for sparse features.
  • Feature vectors in NLP tend to be sparse.
  • The range of similarity values is generally between -1 and 1. However, when features are non-negative, the range becomes 0 to 1, which is the case with TF-IDF.
  • Normalization can be performed instead of using the inner product: Similarity = x.dot(y) / (|x||y|).
  • Normalization avoids doubling the length of documents by copying.
  • One side effect of normalization is that long documents and tweets may appear similar, which is undesired when suggesting tweets to someone reading a document.
  • To address this, an alternative approach is to cap the maximum word count that can appear in a vector.

Cosine vs Euclidean:

  • The choice between cosine and Euclidean distance depends on the use case.
  • Cosine distance is more suitable for text features where the magnitude of the vector is not as important as the orientation.
  • Euclidean distance is a better choice for count features, such as measuring how many times a document was read.
  • For mixed feature types, both cosine and Euclidean distances can be computed (possibly with a subset of relevant features) and combined using appropriate weights, considering the trade-offs and requirements of the specific problem.

 

Brute Force

  • Calculate distance to each document in corpus
  • Suppose there are N documents in corpus
  • Complexity  would be O(N) for 1-NN
  • It will be O(N log k) for k-NN
    • Using priority queue
  • It is computationally inefficient when we N is large and we need to query often

KD-Trees

KD-Trees are used to divide the space into hierarchical regions via a tree structure. Here’s how it works:

  • At each node, we store the smallest bounding box that contains all the data points in that region.
  • While querying (for 1 nearest neighbor, which can be extended for k nearest neighbors):
    • First, we find the leaf node where the query point belongs.
    • Then, we find the nearest neighbor within that node.
    • We keep moving up in the tree.
    • If the distance between the query point and the bounding box is greater than the distance found so far, we skip that region and move up the tree.
    • If it is less, we traverse down the tree.
  • To calculate the distance between a point and a rectangle, you can use appropriate distance metrics like Euclidean distance or other suitable measures. It is computationally easy rectangles are axis aligned[1]

Construction of KD-Trees:

Here are the steps involved in constructing KD-Trees:

  • Choose a splitting strategy:
    • Widest dimension: Split based on the dimension with the widest range.
    • Alternating dimension: Split alternately between dimensions.
  • Determine the value to split on (split value):
    • Median: Choose the median value of the points along the splitting dimension.
    • Center point: Compute the center point as the average of the left and right extremes.
  • Stop conditions for splitting:
    • If there are fewer points left than a specified threshold.
    • If the minimum width of the region is achieved.

How to Prune More Aggressively:

To prune more aggressively in KD-Trees, consider the following approach:

  • Suppose the nearest distance found is r.
  • Instead of pruning points that are farther than distance r, prune all points that are farther than (r/a), where a > 1.
  • This allows for more aggressive pruning and can speed up the search process.

Limitations of KD-Trees:

KD-Trees face challenges in high-dimensional spaces. Here are some limitations:

  • In high dimensions, the radius of the search space is generally large.
  • For d dimensions, the radius intersects 2^d hypercubes, resulting in an extensive search space.
  • One possible solution is to prune all points that are farther away than (r/a), where a > 1, to mitigate these issues.

 

LSH – Locality Sensitive Hashing

Locality Sensitive Hashing (LSH) is a technique used to classify data points into various bins and search within a given bin. Here’s how it works:

  • The idea is to group data points into bins based on their proximity and search within the relevant bin(s).
  • If necessary, nearby bins can also be explored to expand the search.
  • To determine nearby bins, multiple random lines (or planes in higher dimensions) are used.
  • Let’s consider a motivating example with points in a 2D plane:
    • A single random line passing through the origin is chosen.
    • Points that are closer to each other will mostly fall into the same bin.
  • However, using just one line can lead to more data points in a single bin, increasing query time.
  • The solution is to use multiple random lines (or planes) to achieve more efficient classification of neighboring bins.
  • For example, let’s consider three lines, f1, f2, and f3, defined as f(x,y) = 3x + 4y = 0:
    • If f1(p,q) < 0, f2(p,q) > 0, and f3(p,q) > 0 for a given point (p,q), we assign it to the bucket [0,1,1].
  • To facilitate searching within bins, a dictionary can be used, where the key represents the bin (e.g., [0,1,1], [0,1,0]), and the value is a list of all points in that bin.
  • To find nearby bins:
    • Flip one bit to find bins with a single bit difference.
    • For more exploration, flip two bits to find bins with two bits difference.
  • It is possible for the nearest point to be in a different bin.
  • The number of neighboring bins to explore depends on factors such as computational budget or a predefined quality threshold.
  • In some cases, an exact nearest neighbor may not be necessary, and a point with a predefined quality (epsilon) is sufficient.
  • LSH is useful in scenarios like document suggestion, where similar documents need to be retrieved efficiently.
  • LSH can also be extended to high-dimensional spaces by using planes or hyperplanes instead of lines.
  • For assigning bins to data points, dot product calculations are performed, which work well even in large dimensions, especially when the feature vectors are sparse.

References

[1] https://stackoverflow.com/questions/5254838/calculating-distance-between-a-point-and-a-rectangular-box-nearest-point

[2] https://gamedev.stackexchange.com/questions/44483/how-do-i-calculate-distance-between-a-point-and-an-axis-aligned-rectangle

Overfitting in Decision Trees

We had previously written about classification trees and regression trees.

In this blog we will mainly write about preventing over fitting in decision trees.

Handling missing values for decision tree is described here.

Over-fitting

  • For decision trees as depth increases training error always go towards zero.
  • Occam’s razor philosophy
    • Suppose there exists two explanation for an occurrence
    • Then less complex explanation is generally correct
    • You have headache and stomachache.
    • Disease D1 explains headache, D2 explains stomachache, while D3 explains both
    • Instead of diagnosing patient with both D1 and D2, saying it D3 would be more accurate in general
  • So when you have two trees with same validation error, choose one with lesser complexity
  • There are two approaches:
    • Early Stopping
    • Pruning

Early Stopping

There can be three stopping condition

  1. Pre define max_depth
    1. It can be tuned via cross validation, which increases training time
    2. Does not work well when you have less data
  2. Stop if decrease in classification error is less than some threshold
    1. It can become tricky as in case of XOR
    2. XOR is very famous case in classification literature 
  3. Stop if very few point are left in the node
    1. This works pretty well in practice
    2. Stop if node has say 10 to 100 sample

Pruning

  • Pruning generally is better solution than early stopping
  • Here we build the entire tree first and than remove certain node
  • Criterion for removing nodes:
    • cost(tree) = error(tree) + λ*L(tree)
    • L(tree) = no of leaf nodes
  • if cost(deep_tree) > cost(small_tree) then don’t split that node
  • Algorithm:
    • Start from bottom up the tree and traverse up and test one node each time
    • Calculate cost
    • Make decision to prune it or not

References:

Course by University of Washington

https://www.coursera.org/learn/ml-classification

Handling missing values in Decision Tree

We had previously written about classification trees and regression trees.

In this blog we will mainly write about handling missing values specifically for decision trees.

Handling over-fitting for decision tree is described here.

Missing Values

Missing value can cause issues at :

  1. Training time
  2. Prediction time
    1. What if while prediction sample does not have value for some feature

Purification by Skipping

  • This approach involves skipping rows or entire features that contain missing values.
  • However, this method comes with a drawback as it results in the loss of valuable information.
  • It is important to note that skipping missing values during training will not resolve the issue during prediction time

Imputation

  • Imputation refers to the process of filling in missing values with estimated or imputed values.
  • It provides a means to address missing values during both training and prediction.
  • A dictionary can be created to store imputed values for each feature, which can then be utilized during prediction.
  • For categorical features, a common imputation method is to use the mode, while for numeric features, the average or median is often employed.
  • Advanced techniques, such as the expectation maximization algorithm, can also be utilized for imputation.

However, it’s essential to be aware that imputation can introduce systematic biases into the data. For example, let’s consider a scenario where people from Washington tend to not provide their age, while the rest of the United States does. This could lead to a systematic bias in the imputed values, as the imputation process may assign ages based on data from other regions.

Algo specific technique

In decision tree algorithms, there are specific techniques to handle missing values. At each node in the decision tree, we need to determine which branch the missing value will take. It’s possible for the same feature to have different values at different nodes. To address this, we can make a tweak in the decision tree algorithm:

  1. During feature selection, consider assigning the missing value to each possible branch.
  2. Evaluate the effect of assigning the missing value to each branch by measuring the reduction in classification error.
  3. Choose the feature and branch that result in the highest reduction in classification error when the missing value is assigned.

However, what if we encounter a node during prediction where we didn’t have any missing values during training? In such cases, we can store the direction with the most number of samples at that node as the default direction for missing values. Let’s call this direction “max_sample.”

If we encounter missing values while training the decision tree, we can override the default “max_sample” direction with the observed direction for the missing values.

By incorporating these techniques, we can effectively handle missing values in decision trees and make informed decisions based on the available data.

References:

Course by University of Washington

https://www.coursera.org/learn/ml-classification

On Classification Accuracy – 2

We have already talked about it in this post. Just want to add few more things after finishing a course. This post is just an extension of above with some practical considerations.

We are claiming that accuracy may not be a good measure always. When you are building automated machine learning you must trust it.

Case Study

  • You want to show positive reviews on your website.
  • Say in your dataset 90% reviews are negative.
  • A classifier can achieve 90% accuracy by predicting all of them as negative.
  • But what you are interested in is finding out remaining 10% and display it on your website.

 

Precision = Did I show something negative?

Recall = How good I am at finding positive reviews?

 

Analogy with Optimist and Pessimist

  • Optimist assigns every/most review as positive
    • Very good recall, but less precision
  • Pessimist assigns every/most review with negative
    • Bad recall, good precision

 

Trade-off

  • Trade-off comes while scoring, not while training
  • We can assign labels based on probabilities
  • Decision tree gives probability by no of positive and negative samples at leaf node
  • Logistic regression of-course gives probability
  • We can change threshold to trade off between precision and recall
  • Positive when prob > 1 => Pessimist
  • Positive when prob > 0 => Optimist

 

Single no not always useful

  • Single numbers like F1 score and AUC are something I am not great fan of
  • You can not always choose classifier just by AUC, ROC curve might intersesct
    • This intersection means that one classifier is better at some range of precision
    • But if they don’t intersect we choose the one with higher AUC
  • From business perspective we are should be clear whether we want more precision or recall
  • Another practical metric they talked about was precision at k
    • Say I want to display 5 reviews on my website
    • What is the precision after 5 values I have chosen

 

 

knn and kernel regression

This is to summarize learning from course by University of Washington hosted on Coursera.

 

Parametric vs Non parametric

Parametric models have a predefined complexity, meaning the complexity is fixed regardless of the number of observations. On the other hand, non-parametric models allow the complexity to grow as the number of observations increases.

Infinite Noiseless Data

When dealing with infinite noiseless data, it is important to note that quadratic fit introduces some bias. However, 1-NN (nearest neighbor) can achieve zero RMSE (root mean squared error).

Examples of Non-parametric Models

Non-parametric models include kNN (k-nearest neighbors), kernel regression, spline, and trees. These models do not make strong assumptions about the underlying data distribution and can adapt to varying complexities based on the observed data.

1 NN

n the nearest neighbor (NN) prediction, we identify the closest data point to the query point, and the response of that nearest point is considered as our prediction.

Voronoi Tesselation

When working with multidimensional data, plotting the nearest neighbor prediction results in a Voronoi tesselation, also known as a Voronoi diagram. This diagram divides the space into regions based on the closest data point for each query point.

Distance Metrics

Distance metrics play a crucial role in nearest neighbor algorithms. Some commonly used distance metrics include:

  • Euclidean distance: Measures the straight-line distance between two points in the feature space.
  • Scaled Euclidean distance: Allows for different weights on different dimensions, which is useful when certain features carry more importance. For example, in predicting house prices, the square footage may be weighted more heavily than the number of floors.
  • Other examples of distance metrics include Mahalanobis distance, rank-based distance, correlation-based distance, cosine similarity, Manhattan distance, and Hamming distance.

1 NN in Practice

The 1 NN algorithm performs well when the data is dense. However, there are limitations to its effectiveness:

  • In non-dense data regions, it struggles with interpolating between observations.
  • It is sensitive to noise, as a single noisy data point can significantly impact the prediction.
  • 1 NN tends to overfit the training data, resulting in poor generalization to new observations.

To address these limitations, the k-nearest neighbors (kNN) algorithm is often employed.

 

k-Nearest Neighbors (kNN)

In the kNN algorithm, we consider the k nearest neighbors and use their responses as predictions. This approach helps reduce the impact of noise compared to using only the nearest neighbor (1NN).

Challenges:

  1. Boundary issues: When dealing with boundaries, the same points may repeatedly appear as nearest neighbors, resulting in a flat response.
  2. Sparse region issues: In sparse regions, the same points may also be repeatedly chosen as neighbors, leading to potential inaccuracies.
  3. Discontinuity of fit: The fit obtained using kNN may not be smooth since one neighbor can suddenly be excluded from the set of nearest neighbors.

To address these challenges, we can employ weighted kNN, which assigns different weights to each neighbor based on their proximity.

Weighted k-Nearest Neighbors (kNN):

In weighted kNN, less weight is assigned to neighbors that are farther away from the query point. This approach helps mitigate the issue of fit discontinuity in standard kNN.

There are two common weighing schemes used in weighted kNN:

  1. Simple weighing scheme: In this scheme, the weight assigned to each neighbor is inversely proportional to its distance from the query point. The formula for weight calculation is: weight = 1 / distance.
  2. Sophisticated weighing scheme using kernels: Kernels, such as the Gaussian kernel, are employed to assign weights. The Gaussian kernel never reaches zero, ensuring that even distant neighbors have some influence. On the other hand, kernels like the Uniform or triangular kernel eventually reach zero, diminishing the influence of distant points. The parameter λ is used to control how quickly the kernel reaches zero. A faster decay indicates that distant points will have less influence.

By incorporating weighting schemes in kNN, we can achieve a more nuanced and smoother fit, addressing the issue of fit discontinuity encountered in standard kNN.

Kernel Regression:

Kernel regression is an alternative approach to kNN, where instead of considering only k neighbors, we consider all observations in the dataset. This allows for a more continuous and smoother prediction.

The choice of kernel in kernel regression can be either bounded, such as the uniform or triangular kernel. In such cases, we consider a subset of neighbors, but it is not strictly a kNN approach.

When performing kernel regression, two decisions need to be made:

  1. Choice of kernel: The choice of kernel has a relatively smaller impact on the prediction. Various kernels can be used, and their selection is typically based on the specific problem at hand.
  2. Choice of bandwidth: The bandwidth plays a more significant role in the prediction. It determines the spread of the kernel before it reaches zero. A small bandwidth can lead to overfitting, capturing noise and local fluctuations, while a large bandwidth can result in an overly smoothed fit, leading to high bias. The bandwidth can be tuned using the kernel’s parameter λ, which can be selected through techniques such as cross-validation.

By considering all observations and utilizing kernels in kernel regression, we can achieve a more flexible and continuous prediction model, providing a trade-off between bias and variance based on the choice of bandwidth.

Local Linear Regression

Up until now, we have discussed the use of weighted averages for prediction. However, an alternative approach is to fit a model in the vicinity of the prediction point, where the errors are weighted by a kernel.

In local linear regression, we can fit either a linear model or a quadratic model near the prediction point. A linear model is particularly useful for addressing boundary effects, as it provides a linear prediction instead of a constant one.

On the other hand, a quadratic fit can handle curvature in the data but may introduce higher variance in the predictions. Consequently, in practice, a linear fit is often recommended due to its favorable balance between bias and variance.

By employing local linear regression, we can adapt the model’s behavior based on the local data characteristics, enabling more accurate predictions near the prediction point.

Global vs Local Fit

In certain situations, we may find that a linear fit is appropriate for some regions of the input space, while a quadratic fit is more suitable for others. However, determining the breakpoints where the underlying structure changes can be challenging.

Non-parametric models come to the rescue in such cases. Instead of assuming a specific functional form, these models allow for more flexibility in capturing the underlying patterns.

One example of a global fit is taking the average as a constant prediction. This approach provides a simple representation of the data but assumes a uniform behavior across all observations.

On the other hand, kernel regression offers a way to achieve a local fit by applying different weights to each observation. Nearer observations receive higher weights, enabling the model to adapt to local variations in the data.

By leveraging non-parametric models like kernel regression, we can capture both global and local characteristics of the data, allowing for more accurate and adaptive predictions based on the proximity of observations.

Limitations of Non-parametric Approaches

While non-parametric approaches like kNN offer flexibility and adaptability, they do have certain limitations.

  1. Dimensionality: In higher dimensions, non-parametric models require an exponentially large number of observations to accurately capture the underlying patterns. As the number of dimensions increases, the data becomes more sparse, making it challenging to find sufficient neighboring points for accurate predictions.
  2. Data Availability: Non-parametric models heavily rely on the availability of a large amount of data. When the dataset is limited or scarce, it becomes difficult to leverage the full potential of non-parametric approaches.
  3. Computational Complexity: Non-parametric models, such as kNN, involve a brute force search to find the nearest neighbors. This search operation has a complexity of O(N) for 1-NN and O(N log K) for k-NN, where N is the number of observations. While these complexities can be manageable for smaller datasets, they can become computationally expensive as the dataset grows larger.

To mitigate some of these limitations, techniques like clustering can be employed to reduce the search space and improve computational efficiency.

In situations where the dataset is limited or high-dimensional, parametric models offer an alternative by making certain assumptions about the underlying structure. These models provide a more compact representation and are often more suitable when data availability or computational complexity is a concern.

Understanding the limitations of non-parametric approaches helps in selecting the most appropriate modeling technique based on the specific characteristics of the dataset and the available resources.

References

Course by University of Washington

https://www.coursera.org/learn/ml-regression

 

 

Interpretation of Multiple Regression Coefficients

  • In case of simple linear regression with one variable we interpret slop coefficient as follows
    • y = b0*x0 + c
    • b0 is increase in y for unit increase in x0
  • In case of multiple regression:
    • y = b0*x0 + b1*x1 + c
    • b0 is increase in y for unit increase in x0 keeping x1 constant

 

Implication of keeping other variable constant:

  • Consider
    • house_price = b0 * no_of_bedrooms + c       ….. (1)
    • house_price = b0 * no_of_bedrooms + b1*square_feet + c     …..(2)
  • In (1) there are high chances that b0 will be positive
  • In (2) it can be negative
    • If you increase no of bedrooms keeping square feet constant each room will be smaller
    • This may decrease house price

 

 

Reference :

Coursera course on regression by University of Washington DC : https://www.coursera.org/specializations/machine-learning