Anomaly Detection Basics

March 21, 2021November 16, 2021Archit Vora Leave a comment

These are notes form Andrew N.G’s machine learning course on coursera.

Anomaly Detection Problem

Assembly line prepared aircraft engine, you want to check if it okay.
- Features can be
  - heat generated
  - vibration intensity
Fraud detection in finance/retail
- Feature would be based on user’s activity
Monitoring CPUs in data center
- Features would be memory used, CPU load, network traffic.
Generally we have less number of positive (anomalous example) compared to negative ones (normal).

Solution Based on density Estimation

We try to fit normal examples in gaussian distribution
For new engine we estimate the probability of it p
If p < epsilon – we flag the new engine as anomalous
We generally train one gaussian per features and multiply like naiver bayes
- That is to say off diagonal elements in multivariate gaussian are zero
- More details in the last section of this post ( multivariate gaussian )

Model Evaluation

Once we have multiple models having different features how to evaluate which one is better ?
We also need to tune epsilon parameter.
We can use standard setup of train-set, test-set and cross validation set
Train set would have normal examples only. It would okay if few anomalous samples slips in
- So training is unsupervised only
- Predict y = 1 if p(x) < epsilon else 0
Bad metric
- classification accuracy (because classes are imbalanced)
Good metric
- TP, FP, TN, FN
- Precision/recall
- F1 score
Cross validation set is used for tuning epsilon

Anomaly Detection vs Supervised Learning

Supervised model like logistic regression would require
- 1) More training examples
- 2) Somewhat balanced classes

Feature Engineering

Since we are fitting gaussian we need to do some transformation if feature distribution does not look like one. Popular transformations are
- log (x)
- log (x + c)
- sqrt (x)
How to introduce new feature
- We need to do this when p(x) is comparable for normal and anomalous sample
- Once you find anomalous sample for which p(x) is not low enough, try looking deep into it.
- Property which is making it anomalous would be a new feature to add
Feature Engineering Recommendation
- Think about features which will be too high or too low in case of anomaly
- x5 and x6 can be a good feature in below image.

Multivariate Gaussian

Shortcoming of individual gaussian is that in case of correlated features it won’t be able to detect the anomaly.
- Green sample in above image will not be detected
To mitigate this we can hand-code ratio based features
Original model is more popular because it scales well with no of features.

Gradient Boosting

September 26, 2020April 26, 2023Archit Vora 1 Comment

Some people use term bagging and boosting, some use parallel and sequential ensembles. I have written a post on AdaBoost previously. So we are familiar with boosting idea.

In AdaBoost we had a formula for generating labels (label/sample weight in loss function to be precise) for next iteration. In gradient boosting we fit gradient of loss function with respect to predictions.

Prof Balaraman notes

Most of this post contains info from [0]. However Prof. Balaraman offers intuitive perspective in one of his lectures. [3]
My handwritten note on that is available at [2]
We can write decision tree as equation where regions and value in the region are parameters.
Two steps of gradient boosting
1. Fit gradient of loss function using regression trees (least square loss)
2. Compute values in region using loss defined for application (ndcg for ranking, log-loss for classifier anything)
Whatever is application loss, we fit regression tree using least square
Parametric gradient descent can be seen as series of summation starting from initial guess.
Related paper [4]

Basis Functions

Basis functions
- If we want to fit non linear curve with simple linear regression, we will create polynomial features
- It is not limited to regression though
Adaptive Basis function
- Instead of we defining above functions, we learn them
- This can be framed as optimisation problem and can be solved by gradient descent/newton’s method for some hypothesis space
How to include Tree’s in base hypothesis space
- Something which we can not differentiate
- Even if we fix parameters to split on output is still not continuous
- So it is difficult

FSAM

FSAM
- Forward Stage-wise Additive Modelling
- AdaBoost is example of this, another example is L2 boosting
- We can say weak function is step direction and coefficient is step size

L2 Boosting also example of FSAM
- We are fitting residual
- To prevent overfitting we can reduce step size
- If v*h(x) is also in H, we can remove v. (closed under rescaling)
- h(x) can be decision stump we can keep going as long as residuals exist (It will certainly overfit after few iterations)

Adaboost
- It works for classification problem
- When the loss function in FSAM is exponential loss of margin, it reduces adaboost
- margin = y*f(x)
  - We had encountered that in SVM

Disadvantages of AdaBoost
- Gives too much penalty for negative margin, look at the curve of exponential above
- Mislabel training data
- Sometimes there is noise in world
- Does not work when there is high baye’s error
- Also it seems AdaBoost formulation is very much tied to classification
  - We probably needs to tweak it to apply it to ranking problem
  - But gradient boosting can be applied to any problem (loss function)
    - Because Tree is alway using L2 loss

Functional Gradient Descent

Gradient is pseudo residuals
- For L2 loss it is actually residuals (difference between y and f)
- Examples where residuals are not just difference between y and f
  - logistic loss
  - ranking using ndcg
We fit this gradient(pseudo-residuals) with weak function
- Regression Tree is most common
- We find closest approximation to gradient step in base hypothesis space (which is functional space)
- Here we minimize square loss (regression problem)
Earlier tree used to be just stumps
- That is num of leaf node = 2
- Now a days xgboost/light gbm takes upto 8/16 leaf nodes as well
Note that while calculating gradient the hypothesis function is not involved anywhere
How do we calculate gradient over function space
- In close form expression it might look complicated
- From numerical computation perspective we need gradient at just one point (of dimension training data size)
- This is just like taking gradient of weights in linear regression
  - Here size of gradient vector would be number of training examples
- In most of the method we take gradient of parameters, here we are taking gradient of predictions
What should be step size
- Line search
- Regularisation parameter (similar to what we use for any gradient based optimisation)
  - This is more common

Reference

[0] https://www.youtube.com/watch?v=fz1H03ZKvLM

[1] https://davidrosenberg.github.io/mlcourse/Archive/2016/Lectures/8b.gradient-boosting.pdf

[2] https://github.com/arcarchit/datastories/blob/master/notes/gradient_boost_iitm.pdf

[3] https://www.youtube.com/watch?v=FJ-MA0EfJ9k

[4] https://statweb.stanford.edu/~jhf/ftp/trebst.pdf

log-linear and log-log regression

May 24, 2020November 21, 2021Archit Vora Leave a comment

In linear regression taking log is popular way to make relationship linear. We can take log of either response or predictor or both. This gives us four classes [0]

Interpretations of β

Linear Model
- the coefficient β gives us directly the change in Y for a one-unit change in X
Linear Log Model
- β is the expected change in Y when X is multiplied by e (natural log)
Log Linear Model
- Each 1-unit increase in X multiplies the expected value of Y by e β
Log Log Model
- multiplying X by e will multiply expected value of Y by e βˆ

I have coded notebook to see the curves for all four at [1].

A note on Normalisation

Suppose you need to normalise a data to bring it between 0 and 1. This will be feed into some linear function (say ranking function without supervised response y). If you data is exponentially distributed instead of dividing by max, you can try log(sample)/log(max). This is like feature transformation by taking log and then normalising it.
- That’s about normalising variable which seems to have exponential distribution
Also when you check correlation with response variable and you see plots like [1] you know which transformation to take.

References :

[0] https://kenbenoit.net/assets/courses/ME104/logmodels2.pdf
[1] https://github.com/arcarchit/datastories/blob/master/notebooks/log_linear_models.ipynb

Clustering Metrics

January 22, 2019May 8, 2020Archit Vora Leave a comment

Here are some metric available for validating clustering, explanation of each one is available on sklearn. [0]

If ground truth labels are available:

Adjusted Rand Index
Mutual Information Based scores
Homogeneity, completeness and V-measure
Fowlkes-Mallows scores

If not available :

Silhouette Coefficient
- Range (-1,1)
- 1 means it is similar to data-points in each cluster
Calinski-Harabaz Index
Davies-Bouldin Index
Contingency Matrix

Calculating SSE

It is a sum of distance between each point and its cluster center.

Silhouette Score

It is calculated for each point and then we take an average of it.

a(i) is average distance of a point to other points in same cluster.

b(i) is minimum of above average in for point in other cluster. It given the distance to nearest cluster.

s(i) close to 1 means data point is appropriately clustered. -1 means it is very bad clustered.

Setting s(i) to 0 when cluster size is one ensures that curve is not monotonically decreasing.

Elbow method and Silhouette Analysis

Notebook is available at https://github.com/arcarchit/datastories/blob/master/Silhouette.ipynb

sil1 sil2

Rand Index

When cluster labels are available we can use this matrix
It basically checks the similarity between two cluster assignments
- Labels can also be seen as one type of cluster assignment
- Score basically tells us how similar to cluster assignments are
This works by taking pair of points
- Out of all pairs how many pairs are agreed in both clusters mechanism
- Agree mean both
  - They are in same cluster in both mechanism
  - They are in different cluster in both mechanism
The Rand index has a value between 0 and 1, with 0 indicating that the two data clustering do not agree on any pair of points and 1 indicating that the data clustering are exactly the same

rand_index

One drawback of Rand index is that it can given non zero value for random assignment of clusters. To mitigate that there is matrix called Adjusted Rand Index. [2]
- It specifically does not work when no of clusters are high

Reference

[0] : https://scikit-learn.org/stable/modules/clustering.html

[1] : https://github.com/anthonyng2/udemy-the-complete-machine-learning-course-with-python

[2] : https://davetang.org/muse/2017/09/21/adjusted-rand-index/

Oversampling and Under-sampling

January 21, 2019October 25, 2020Archit Vora Leave a comment

When data is class-imbalanced there is a tendency to predict majority class. One way to tackle this would be apply more weight to minority classes in cost function. Another way is oversampling and under-smapling.

Over-sampling makes duplicate copies of minority classes
Under sampling randomly removes some samples from majority class
- This should be used with caution
- We need to check once that we still remain with enough sample for a given no of features
Practically we might want to over sample some classes and under-sample others.

Cross validation

Validation set should be taken out from original data[1]
- We can do the sampling just before training only on training data

Reference

[0] : https://www.kaggle.com/residentmario/undersampling-and-oversampling-imbalanced-data

[1] : https://www.marcoaltini.com/blog/dealing-with-imbalanced-data-undersampling-oversampling-and-proper-cross-validation

Generative and Discriminative Models

January 18, 2019May 17, 2023Archit Vora Leave a comment

Introduction

In machine learning, there are two broad categories of models: generative models and discriminative models. In this blog post, we will discuss some popular examples of each and their capabilities.

Generative Models

Naive Bayes Classifier
Hidden Markov Models
Latent Dirichlet Allocation
Boltzmann Machine
Gaussian Mixture Model ( Unsupervised clustering)

Generative Models Explanation

Generative models allow us to generate datasets for specific classes, labels, or clusters. They estimate the joint probability distribution P(X, y). An example of this is the Naive Bayes Classifier, where we have a probability associated with each (X, y) pair. For classification, we predict y based on the highest probability given x.

𝑎𝑟𝑔𝑚𝑎𝑥𝑦𝑃(𝑌=𝑦|𝑋=𝑥)=
𝑎𝑟𝑔𝑚𝑎𝑥𝑦𝑃(𝑌=𝑦,𝑋=𝑥)/𝑃(𝑋=𝑥) = 𝑎𝑟𝑔𝑚𝑎𝑥𝑦𝑃(𝑌=𝑦,𝑋=𝑥)/𝑃(𝑋=𝑥)

Generative models offer more than just prediction. They can be used to:

Impute missing data
Compress datasets
Generate unseen data

Discriminative Models

Logistic Regression
Support Vector Machines
Decision Trees

Discriminative Models Explanation

Discriminative models focus on learning the boundaries that separate different classes directly. They do not model the entire joint probability distribution but instead estimate the conditional probability P(y|x). Examples of discriminative models include Logistic Regression, Support Vector Machines, and Decision Trees.

Hidden Markov Models

December 11, 2018May 17, 2023Archit Vora 1 Comment

From a Clustering Perspective

This section summarizes a lecture from the University of Washington [0] on clustering time series data, considering the significance of both the data and indices.

Other potential applications include:

Honey bee dance: Bees switch from one dance to another to convey messages.
Conference conversations: Segmenting speaker assignments based on the spoken turns.
Gym exercises: Identifying exercises from pulse rate data as people switch between activities.

Model

The following screenshots are from a YouTube video [1] by the Mathematical Monk, illustrating the model:

Suppose you’re developing handwriting recognition and need to recognize a hidden variable.

The prediction for “i” depends solely on the previous character being “h,” disregarding how “h” was written.

Code and Notebook

You can find the code and notebook at the following GitHub link [2], which extensively explains:

The structure of the model.
The forward algorithm for calculating the likelihood of a given observation.
The backward algorithm for finding the most probable state sequence given an observation (also known as decoding).
The forward-backward algorithm for inferring model parameters from a set of observed sequences.

References

[0] : https://www.coursera.org/learn/ml-clustering-and-retrieval/home/welcome

[1] : https://www.youtube.com/watch?v=TPRoLreU9lA

[2] : https://github.com/arcarchit/datastories/blob/master/hmm.ipynb

[3] : https://web.stanford.edu/~jurafsky/slp3/A.pdf

Hierarchical Clustering

December 9, 2018May 8, 2020Archit Vora Leave a comment

This post is a lecture summary of course by University of Washington : https://www.coursera.org/learn/ml-clustering-and-retrieval/home/welcome

Why one might want to use hierarchical clustering ?

A common problem is clustering is to predefined no of clusters (k)
- Hierarchical clustering does not need that
Deprogram provides effective visualization of clusters at various levels without re-running the algorithm
Any distance metric can work, while k-mean required euclidean distance [0][1]
We can construct more complex shape compared to k-means/EM

Divisive clustering

Example algorithm is recursive k-mean
We first cluster with k = 2 and then recursive apply k-means on these two clusters
Various choices we need to make :
- Which algorithm to use (k-means)
- How many split at each level (k=2, 3)
- When to stop? Here are some heuristics:
  - Number of points in a cluster falls below certain threshold (Minimum point in each cluster)
  - Distance to farthest point falls below certain threshold (Minimum cluster radius)
  - Until per-specified no of clusters are reached

Agglomerative clustering

Example would single linkage algorithm
We start with assigning each point to its own cluster
Next task is how to merge this cluster
- How do we define the distance between cluster?
  - Linkage function
    - It gives us two points from each cluster
    - Single linkage given points which has minimum distance across all possible pairs
    - Cluster A has points (a1, a2, a3) and B has (b1, b2, b3)
    - Linkage will give say (a2, b3) if it has the minimum distance across all combination
  - Distance Function
    - Can be euclidean or cosine anything
- We merge the cluster which has minimum distance
Dendrogram provides efficient visualization [4]
- Data-points of x-axis carefully ordered
- Height (y-axis) represents distance between two cluster as per specified linkage
- It gives us the complete visualization of which nodes were merged first followed by which one etc.
How to cut dendrogram ? [4]
- Suppose we are cutting the dendrogram at y=D*
- There are no pair of cluster which has distance < D* and which is not merged
- All the pairs of clusters having distance < D* are merged
- D* is the minimum distance between cluster at this level
- Distance between clusters identified > D*
Various linkage functions [2][3]
- Complete linkage takes two farthest points
- Average linkage considers all pairs and takes an average
- Ward linkage focus on increasing sum of squares
- These linkage however brings down the shaping flexibility
  - Single linage would allow for complex shapes in general
  - Single linkage however can result in chaining
Choice we need to make in agglomerative clustering
- Which linkage function to use
- Which distance metric to use
- Should we cut dendrogram flat or some weird cut

dendrogram

Complexity
- For brute-force it is O(N^2 log N)
- We can also prune search space by kd-trees or using triangular inequality
- Best know is O(N^2) [5]

Linkage Functions

linkage

References

[0]: https://stats.stackexchange.com/questions/81481/why-does-k-means-clustering-algorithm-use-only-euclidean-distance-metric

[1]: https://pdfs.semanticscholar.org/a630/316f9c98839098747007753a9bb6d05f752e.pdf

[2] : https://www.saedsayad.com/clustering_hierarchical.htm

[3] : https://stats.stackexchange.com/questions/195446/choosing-the-right-linkage-method-for-hierarchical-clustering

[4] : http://www.econ.upf.edu/~michael/stanford/maeb7.pdf

[5] : https://www.cs.ucsb.edu/~veronika/MAE/summary_SLINK_Sibson72.pdf

Which ML algorithm to use when?

December 7, 2018December 7, 2018Archit Vora Leave a comment

We generally find this question while starting up new project or we want to compare some algorithms to discriminate them (discrimination helps understand things better sometimes).

Although there is no definitive answer to this, I am writing here summaries from some posts. [0]

machine-learning-cheet-sheet

Factors to consider

Accuracy
- Most of people focus on accuracy but it practically it is not the only
Training time
- Naive Bayes and logistic regression are much faster than boosting or neural nets
Linearity
- LR and SVM are suitable when classes are linearly seperatble
  - Of course SVM bypasses it via kernel trick but still not as much complex decision boundary as nueral nets
- Despite the risk of non linearity in data linear algorithms tends to work well in practice and are often used as starting point
Number of parameters
- Parameters does affect training time and accuracy
- More parameters helps learning complex function, however it requires more data to prevent over-fitting
No of features
- When data point are not enough for no of features (text, NLP) SVM works well

Notes

Try out linear/logistic regression, SVM first when you most dependent variables are numeric.
SVM
- SVM suites more when no of data points are less for given no of features.
- SVM is linear classifier only. It just uses kernel trick to project linearly inseparable data on high dimension.
- SVM is solved by mathematical optimization problem unlike nueral nets. Hence tends to be bit faster.
What is the difference between LR and SVM?
- LR has linear decision boundary while SVM can have non linear decision boundary.
Reinforcement learning
- Analyses and optimized behavior of agent, (via feedback from environment)
- They try to discover different actions to maximize reward
- Trial-error and delayed reward distinguishes reinforcement learning from other ML algorithms

References

[0] : https://blogs.sas.com/content/subconsciousmusings/2017/04/12/machine-learning-algorithm-use/#prettyPhoto

[1] : https://docs.microsoft.com/en-us/azure/machine-learning/studio/algorithm-choice

Naive Bayes Classifier

December 4, 2018May 17, 2023Archit Vora 2 Comments

There are two things[1]
- Probability model
- Classification model

Probability Model

A probability model is an extension of Bayes’ rule. It makes two assumptions:

Independence of Features: This assumption assumes that all features are independent of each other. However, it does not hold true in many cases. For example, having higher temperature does not necessarily imply higher humidity.
Equal Weight of Features: This assumption assumes that all features have equal importance or weight in the model.

Classification Model

The classification model involves the following steps:

Probability of Each Class: P(y) represents the probability of each class based on the training set.
Probability Estimation of Feature Values: The goal is to estimate the probability distribution of each feature value given a specific class, denoted as P(x_i|y). For discrete features, this can be achieved through simple probability calculations, such as multinomial Naive Bayes. For continuous features, Gaussian distributions can be used. In the case of count data, multinomial distributions are suitable.
Parameter Estimation: Parameter estimation is performed for each combination of class and feature.
Scikit-learn and Distribution Types: Scikit-learn library provides implementations of Gaussian Naive Bayes, Bernoulli Naive Bayes, and multinomial Naive Bayes classifiers. These classifiers refer to the distribution of features. It is important to note that different features can follow different distributions. Therefore, customization of the distribution based on the application may be necessary.

Advantages

Fast and Easy Implementation: Naive Bayes classifiers are known for their simplicity and efficiency in implementation.
Acceptable Classification Performance: While Naive Bayes classifiers may not always accurately predict probabilities, their classification performance is generally satisfactory.

Disadvantage

Independence Assumption: The assumption of feature independence does not hold true in all scenarios, which can affect the model’s accuracy.

Reference

[0] https://towardsdatascience.com/naive-bayes-classifier-81d512f50a7c

[1] https://en.wikipedia.org/wiki/Naive_Bayes_classifier