Simple Linear Program

Purpose is to know how it looks solving linear programs using libraries and solvers.

Reference : https://pythonhosted.org/PuLP/CaseStudies/a_blending_problem.html

 


\* Minimize food cost *\
Minimize
Total_Cost: 0.008 Beef_percentage + 0.013 chicken_percentag
Subject To
Fat_Constraint: 0.1 Beef_percentage + 0.08 chicken_percentag >= 6
Fiber_Constrint: 0.005 Beef_percentage + 0.001 chicken_percentag <= 2
Protine_Constraint: 0.2 Beef_percentage + 0.1 chicken_percentag >= 8
Salt_constraint: 0.005 Beef_percentage + 0.002 chicken_percentag <= 0.4
Total_Sum: Beef_percentage + chicken_percentag = 100
End


"""
THis is taken from https://pythonhosted.org/PuLP/CaseStudies/a_blending_problem.html
"""
from pulp import *
problem = LpProblem("Minimize food cost", LpMinimize)
x1 = LpVariable("chicken_percentag", lowBound=0)
x2 = LpVariable("Beef_percentage", lowBound=0, upBound=None)
problem += 0.013*x1 + 0.008*x2, "Total Cost"
problem += x1 + x2 == 100, "Total Sum"
problem += 0.1*x1 + 0.2*x2 >=8, "Protine Constraint"
problem += 0.08*x1 + 0.1*x2 >=6, "Fat Constraint"
problem += 0.001*x1 + 0.005*x2 <=2, "Fiber Constrint"
problem += 0.002*x1 + 0.005*x2 <=0.4, "Salt constraint"
problem.writeLP("minimize_food_cost.lp")
problem.solve()
print "Solution found was :", LpStatus[problem.status]
print "Here is the proportional of chick and beef that should be used"
for v in problem.variables():
print "\t",v.name, " : ", v.varValue
print "Cost will be : ", value(problem.objective)

view raw

pulp_LP.py

hosted with ❤ by GitHub


Solution found was : Optimal
Here is the proportional of chick and beef that should be used
Beef_percentage : 66.666667
chicken_percentag : 33.333333
Cost will be : 0.966666665

view raw

std_output.txt

hosted with ❤ by GitHub

 

 

ARIMA model

ARMA to ARIMA

  • When there is a trend in data we take differences
  • ARIMA – Auto regressive Integrated Moving average
  • Integrated term includes order of difference, In the example below it is d=2

arima

Below is the sample github gist and output pdf is avaialble at ARIMA model.pdf


title: "ARIMA model"
author: "Archit Vora"
date: "April 3, 2018"
output: html_document
“`{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
“`
We will try to fit ARIMA model to BJsales dataset from 'dataset' package in R.
Here is the plot of it.
“`{r}
library(datasets)
plot(BJsales)
“`
Mean in changing over time and seems time series is not statioary. Let's take the difference.
“`{r}
plot(diff(BJsales))
“`
Seems It is still not staionary, let's take one more diff.
“`{r}
plot(diff(diff(BJsales)))
“`
Now it seems stationary. Let's plot acf and pacf of this doubly differenced series.
“`{r}
focus<- diff(diff(BJsales))
acf(focus)
pacf(focus)
“`
From ACF lag 1, 8 and 11 seems significant, while from PACF seems lag 1, 2, 3, 10 and 19 seems significant.
Keeping parsimonious principle in mind we shall consider order of 0 and 1 for MA terms and oreder of (0, 1, 2, 3) for AR terms.
Now let's try differenct models and check their AIC.
“`{r}
d <- 2
for (p in 0:3){
for (q in 0:1){
if(p+d+q<6){ # Ensures simple model
mm<- arima(x = BJsales, order = c(p, d, q))
pval <- Box.test(mm$residuals, lag = log(length(mm$residuals)))
sse <- sum(mm$residuals^2)
aic <- mm$aic
cat(p, d, q, "AIC = ", aic, "SSE = ", sse, "P-value = ", pval$p.value, "\n")
}
}
}
“`
Seems like (0, 2, 1) has smallest AIC but does not have significant p-value.
Let's leave it for this post and plots residual.
“`{r}
model <- arima(x = BJsales, order = c(0, 2, 1))
par(mfrow = c(2, 2))
plot(model$residuals)
acf(model$residuals)
pacf(model$residuals)
qqnorm(model$residuals)
“`
There seems no significant correlation as well as QQ-plot also looks okay.
Now let's plot the forecast.
“`{r}
library(forecast)
fc <- forecast(model, h=20)
plot(fc)
“`
All this can also be done by auto.arima routine of forecast package.
“`{r}
model <- auto.arima(BJsales, seasonal = FALSE)
model
fc <- forecast(model, h = 20)
plot(fc)
“`
auto.arima has come up with different model, but it is okay.

view raw

arima.Rmd

hosted with ❤ by GitHub

Fitting AR Processes

Yule Walker Equation in Matrix Form

 

ym1

  • If we write and above equation for k=1, 2, . . ., n and use the fact that ρ(k) = ρ(-k), we can write it in a matrix form.
  • Using the data we have we can estimate values of ρ  (auto correlation coefficients)
  • acf() routine in R gives us that
  • Using values of ρ we can then estimate values of Φ (parameters of AR process)

 

ym2

  • Above is an example for AR process
  • We can solve these equation for values of Φ1, Φ2 and Φ3

 

Reference:

Moving average and Auto-regressive Processes

Moving Average Processes MA(q)

  • Stock price depends on announcements of last two days
  • Auto correlation function cuts off at q

maq

 

Auto regressive Processes AR(p)

 

1

2

3

 

  • Below are the plots for AR(2) process
  • Depending upon the value of phi1 and phi2 ACF has alternative positive and negative values

 

56

 

Writing AR(p) process as MA process by substituting values of X(t-1). And yes phi is constant, we don’t need phi1, phi2 anymore.

78

 

Mean, variance and auto-correlation of AR(p) process, we have assumed Z = Norm(0, sigma2)

9

 

 

ACF of AR-p using Yule-Walker Equation

  • It is a method of solving difference equation in recursive relation
  • We first obtained auxiliary equation (also known as characteristic equation) which is polynomial and find root of that
  • Using these root we get weighted geometric series and find weights using some initial condition
  • We had learned in mathematics that this way of solving difference equation also related to solving differential equations
  • In the course they had solved it for Fibonacci series and root had come out to be golden ratio
  • For AR(p) ACF comes out to be difference equation, solving which can give us ACF for different values of lag

 

 

Reference

https://www.coursera.org/learn/practical-time-series-analysis/home/welcome

https://en.wikipedia.org/wiki/Autoregressive%E2%80%93moving-average_model

 

 

 

 

 

Stationarity Conditions for MA(q) and AR(p) Processes

Sequence and Series

Convergent Sequence

1/2, 2/3, 3/4, . . . , n/(n+1)

Divergent Sequence

3, 9, 27, . . . . , 3^n

Series => Partial Sum of sequence

Convergent Series => if sum converges

Convergence Test

  • Integral Test
  • Comparison Test
  • Limit comparison test
  • Alternating Series Test
  • Ratio test
  • Root test
Geometric Series

  • a, ar, ar^2, . . . , ar^n
  • Convergent if r < 1
Representing function as (geometric) series

seriesRepresntation

Backward shift operator

  • B^kX(t) = X(t – k)

backOp

Invertibility

  • Two models have same ACF
  • Given ACF how to find out the model
  • We will go for model that is invertible
  • We can invert MA(1) into AR(∞)
  • Inverting is basically act of expanding function in geometric series
  • It is possible when growth r<1
  • Out of two models only one satisfies this condition
  • We will select that model given ACF

Conditions for Invertibility[MA(q)] and Stationarity [AR(p)]

dual

How to check if series is both invertible and stationary

  • Check AR(p) polynomial for stationarity
  • Check MA(q) polynomial for invertibility
  • Both should hold

Reference

https://www.coursera.org/learn/practical-time-series-analysis/exam/ITocA/series-backward-shift-operator-invertibility-and-duality

 

[Time Series] Correlation and Stationarity

Co-variance vs Correlation

  • Correlation is co-variance divided by standard deviation of both variables
  • Hence it is independent of units and is always between -1 and 1, which makes comparison easier
  • Formula on the right is time series specific
    • It is auto correlation coefficient at lag k
    • It is define as ration of auto-correlation at lag k divide by auto-correlation at lag 0
    • This values are plotted on correlogram  (See one for MA(2) process below)

acf

 

Stationary Time Series

  • No systematic change in mean (No trend)
  • No systematic change in Variance
  • No periodic variation (Seasonality)

If time series is not stationary we apply several transformation to make it stationary.

For example applying difference operator to random walk makes it stationary.

 

 

Random Walk

  • Previous value of noise
  • If first value is zero then current value is summation of all the noises so far
  • X(t) = X(t-1) + Z(t)
  • Z(t) = Normal (mu, sigma2)
  • if X(0) = 0 then X(t) = sum(Z(k)) k form 0 to t
  • Expectation[X(t)] = t*mu   – –  Changes with time
  • Variance[X(t)] = t*sigma2   – – Changes with time
  • Not a stationary process
  • let Y(t) = X(t) – X(t-1) = Z(t)  – – Y(t) is a stationary process

 

Example of Stationary Process

Moving average and Auto regressive processes described here can be stationary under conditions described here.

 

 

References

 

Further reading

 

 

 

 

Iterative Method for Unconstrained Optimization

There are two things

  1. Newton method for finding roots
    1. Does not have second order derivative in denominator
  2. Using 1 to solve for minima problem
    1. Minima problem is same as finding root of derivative  
    2. Has second order derivative in denominator
  3. There is a nice derivation starting from taylor series [3]

Newton’s Method

newton
  • Based on Taylor series expansion
  • Advantages
    • Convergence is rapid in general, quadratic near optimal point
    • Insensitive to no of variables
    • Performance does not depend on choice of parameter
      • Gradient method depends on learning rate
  • Disadvantages
    • Cost of computing and storing Hessian
    • Cost of computing single newton step
      • You need double derivative (Example in note is a simple root finding problem)

Gradient Descent

  • Very popular method and does not need any write up
  • Exhibits approximately linear convergence
  • Advantage
    • Very simple to implement
  • Disadvantage
    • Convergence rate depends on number of the Hessian
    • Very slow when for large no of variables (say 1000 or more)
    • Performance depends on choice of parameters like learning rate

 

Gradient Descent vs Netwon’s Method

Edit 1 : I believe term step is more suitable here. Gradient step and Newton step. Method is more suitable for interior point methods, active set methods, cutting plane methods and proximal methods.

Newton’s method for finding roots:

3

In optimization we are essentially finding roots of derivative.

1
2

We are approximating function using Taylor series. We are finding minimum of this approximated function. This root will be our new guess. This perspective is used in derivation.

4

Andrew N.G Lecture [2]

 Gradient Descent Newton
 Simpler Slightly more complex (Requires computing and inverting hessian)
 Needs choice of learning rate alpha No parameters (third point in image is optional )
 Needs more iteration Needs fewer iteration
 Each iteration is cheaper O(n) where n is no of features Each iteration is costly. Hessian is (n+1) * (n+1). Inverting matrix is roughly O(n^3)
 Use when no of features are less (n<1000) Use when (n > 10,000)

Gradient can only give you direction. If you want to know next point (or step size), second order derivative (hessian) is must.

Golden Section Search

  • Typically applicable for one dimension only
  • We used it to calculate mobile/tablet adjustments
    • As a good practice we had avoided recursion and took at max 20 iteration breaking loop with some criterion
  • Applicable for strictly unimodal function
  • Three points that maintain golden ratio (phi)
  • Following two problems are different :
    • Searching in rotated sorted array
    • Finding a max on uni-modal function
    • Formal has a property that left one end will be smaller than other, which later does not have
      • This property is used in binary search
  • What is special about golden ration ?
from math import sqrt
phi = (1 + sqrt(5))/2
resphi = 2 – phi
# a and b are the current bounds; the minimum is between them.
# c is the center pointer pushed slightly left towards a
def goldenSectionSearch(f, a, c, b, absolutePrecision):
if abs(a – b) < absolutePrecision:
return (a + b)/2
# Create a new possible center, in the area between c and b, pushed against c
d = c + resphi*(b – c)
if f(d) < f(c):
return goldenSectionSearch(f, c, d, b, absolutePrecision)
else:
return goldenSectionSearch(f, d, c, a, absolutePrecision)
f = lambda x: x**2
def test_search():
assert abs(goldenSectionSearch(f, -1, (-1 + resphi*2), 1, 1e-10)) < 1e-10
print "OK!"
if __name__ == '__main__':
test_search()

Reference

[0] http://www.ece.mcmaster.ca/~xwu/part4.pdf

*[1] https://www.youtube.com/watch?v=42zJ5xrdOqo

*[2] https://www.youtube.com/watch?v=iwO0JPt59YQ

[3] https://suzyahyah.github.io/calculus/optimization/2018/04/06/Taylor-Series-Newtons-Method.html

Time series week 1

  • Plotting in R
  • Linear regression properly fitted or not
    • Residue are important thing to observed
    • Q-Q plots for normality test
    • Residues over time
      • Zoomed in residues over time
  • Hypothesis test
    • One, two sided t test
    • Confidence interval
      • Where we think mean lies
      • If it dose not contain 0 we tend to reject null hypothesis (Very broad statement, but I think you got the concept)
  • Correlation function
    • Which quarter data false

 

Ref : https://www.coursera.org/learn/practical-time-series-analysis/home/welcome

 

 

 

On Clustering

K-mean is probably most popular algorithm and most taught algorithms in academia. However it has got many limitation and listing some of them here:

  • You need to specify value of k
  • Can cluster non-clustered data
  • Sensitive to scale
  • Even on perfect data sets, it can get stuck in a local minimum
  • Means are continuous
  • Hidden assumption: SSE is worth minimizing
  • K-means serves more as quantification

 

In Hierarchical clustering you don’t need to specify values of k, you can sample any level from the tree it build either by top down or bottom up approach. Such a tree is called Dendrogram.

Scikit also supports variety of clustering algorithms including DBSCAN and lists which one suits when. http://scikit-learn.org/stable/modules/clustering.html

 

References:

https://stats.stackexchange.com/questions/133656/how-to-understand-the-drawbacks-of-k-means

 

No Free Lunch (NFL)

When averaged across all possible situations, every algorithm performs equally well 🙂

Imagine you have two different classification problems: one involves identifying fraudulent credit card transactions, and the other involves diagnosing medical conditions based on patient symptoms. These two problems have distinct characteristics and requirements.

Now, let’s say you have Algorithm A, which performs exceptionally well on the credit card fraud detection problem. It has been extensively optimized and fine-tuned for this specific problem domain, resulting in high accuracy and low false positive rates.

On the other hand, you have Algorithm B, which excels in diagnosing medical conditions. It has been trained on a vast dataset of patient records and has been shown to provide accurate diagnoses with minimal errors.

If you were to switch the algorithms and apply Algorithm A to the medical diagnosis problem and Algorithm B to the credit card fraud detection problem, you may encounter subpar performance. Algorithm A, which was designed specifically for fraud detection, might struggle to correctly identify medical conditions based on symptoms. Similarly, Algorithm B, which was trained on medical data, may not be effective at detecting credit card fraud.

This example illustrates the essence of the No Free Lunch theorem. While Algorithm A and Algorithm B excel in their respective problem domains, they do not have universal superiority. Each algorithm is designed and optimized for a specific problem, taking into account its unique characteristics and requirements.

The NFL theorem reminds us that when approaching a new problem, it is essential to carefully analyze its specific characteristics, data distribution, and requirements. It is necessary to evaluate and select an algorithm or approach that aligns well with those specific factors, rather than assuming that a single algorithm can provide optimal performance across all problem domains.

In the context of the No Free Lunch (NFL) theorem, an algorithm refers to a specific computational method or procedure used to solve a problem or perform a task. Algorithms can be mathematical, statistical, heuristic, or machine learning-based, among others. Example being Logistic regression, CNN, RNN etc.

In summary, the No Free Lunch theorem emphasizes the importance of considering problem-specific factors and carefully selecting algorithms or approaches that suit the unique requirements of each problem. Different problems may require different algorithms, and there is no universally superior algorithm that can perform optimally for all possible problems.

http://www.no-free-lunch.org/

http://scholar.google.com/scholar?q=”no+free+lunch”+Wolpert&#8221;