Anomaly Detection Basics

These are notes form Andrew N.G’s machine learning course on coursera.

Anomaly Detection Problem

  • Assembly line prepared aircraft engine, you want to check if it okay.
    • Features can be
      • heat generated
      • vibration intensity
  • Fraud detection in finance/retail
    • Feature would be based on user’s activity
  • Monitoring CPUs in data center
    • Features would be memory used, CPU load, network traffic.
  • Generally we have less number of positive (anomalous example) compared to negative ones (normal).

Solution Based on density Estimation

  • We try to fit normal examples in gaussian distribution
  • For new engine we estimate the probability of it p
  • If p < epsilon – we flag the new engine as anomalous
  • We generally train one gaussian per features and multiply like naiver bayes
    • That is to say off diagonal elements in multivariate gaussian are zero
    • More details in the last section of this post ( multivariate gaussian )

Model Evaluation

  • Once we have multiple models having different features how to evaluate which one is better ?
  • We also need to tune epsilon parameter.
  • We can use standard setup of train-set, test-set and cross validation set
  • Train set would have normal examples only. It would okay if few anomalous samples slips in
    • So training is unsupervised only
    • Predict y = 1 if p(x) < epsilon else 0
  • Bad metric
    • classification accuracy (because classes are imbalanced)
  • Good metric
    • TP, FP, TN, FN
    • Precision/recall
    • F1 score
  • Cross validation set is used for tuning epsilon

Anomaly Detection vs Supervised Learning

  • Supervised model like logistic regression would require
    • 1) More training examples
    • 2) Somewhat balanced classes

Feature Engineering

  • Since we are fitting gaussian we need to do some transformation if feature distribution does not look like one. Popular transformations are
    • log (x)
    • log (x + c)
    • sqrt (x)
  • How to introduce new feature
    • We need to do this when p(x) is comparable for normal and anomalous sample
    • Once you find anomalous sample for which p(x) is not low enough, try looking deep into it.
    • Property which is making it anomalous would be a new feature to add
  • Feature Engineering Recommendation
    • Think about features which will be too high or too low in case of anomaly
    • x5 and x6 can be a good feature in below image.

Multivariate Gaussian

  • Shortcoming of individual gaussian is that in case of correlated features it won’t be able to detect the anomaly.
    • Green sample in above image will not be detected
  • To mitigate this we can hand-code ratio based features
  • Original model is more popular because it scales well with no of features.

On multivariate Gaussian

Formulas

Formula for multivariate gaussian distribution

g1

Formula of univariate gaussian distribution

g2

Notes:

  • There is normality constant in both equations
  • Σ being a positive definite ensure quadratic bowl is downwards
  • σ2 also being positive ensure that parabola is downwards

On Covariance Matrix

Definition of covariance between two vectors:

g3

When we have more than two variable we present them in matrix form. So covariance matrix will look like

g4

  • Above is very similar to how we compute sigma^2 in 1-D = (x – mu)^2
  • Formula of multivariate gaussian distribution demands Σ to be singular and symmetric positive semidefinite, which in terms means sigma will be symmetric positive semidefinite.
  • For some data above demands might not meet

Side Note

  • Covariance is directional measure
  • Correlation is scaled measure
    • We normalise by individual variance

Derivations

Following derivations  are available at [0]:

  • We can prove[0] that when covariance matrix is diagonal (i.e there is variables are independent) multivariate gaussian distribution is simply multiplication of single gaussian distribution of each variable.
  • It was derived that shape of isocontours (figure 1) is elliptical and axis length is proportional to individual variance of that variable
  • Above is true even when covariance matrix is not diagonal and for dimension n>2 (ellipsoids)

g5

Notes and example of bi-variant Gaussian

https://github.com/arcarchit/datastories/blob/master/notes/bivariant_gaussian.pdf

First part above says that bi-variant destitution can be generated from two standard normal distribution z = N(0,1).

For any given k-variant Gaussian we can represent it as linear combination of k standard normal distribution. One simpler way to find these coefficient is Cholesky decomposition. Theorem 1 below stats the same thing.

This has a reference from [1].

Linear Transformation Interpretation

g6

This was proved in two steps [0]:

Step-1 : Factorizing covariance matrix

g7

Step-2 : Change of variables, which we apply to density function

g8

On Practical Example

Height, wight and waist size of men in US (Of course it weight can be negative, so it is approximately normal)

References

[0] http://cs229.stanford.edu/section/gaussians.pdf

[1] https://www2.stat.duke.edu/courses/Spring12/sta104.1/Lectures/Lec22.pdf