Simple Linear Program

April 14, 2018April 14, 2018Archit Vora Leave a comment

Purpose is to know how it looks solving linear programs using libraries and solvers.

Reference : https://pythonhosted.org/PuLP/CaseStudies/a_blending_problem.html

	\* Minimize food cost *\
	Minimize
	Total_Cost: 0.008 Beef_percentage + 0.013 chicken_percentag
	Subject To
	Fat_Constraint: 0.1 Beef_percentage + 0.08 chicken_percentag >= 6
	Fiber_Constrint: 0.005 Beef_percentage + 0.001 chicken_percentag <= 2
	Protine_Constraint: 0.2 Beef_percentage + 0.1 chicken_percentag >= 8
	Salt_constraint: 0.005 Beef_percentage + 0.002 chicken_percentag <= 0.4
	Total_Sum: Beef_percentage + chicken_percentag = 100
	End

view raw

minimize_food_cost.lp

hosted with ❤ by GitHub

	"""
	THis is taken from https://pythonhosted.org/PuLP/CaseStudies/a_blending_problem.html
	"""

	from pulp import *

	problem = LpProblem("Minimize food cost", LpMinimize)

	x1 = LpVariable("chicken_percentag", lowBound=0)
	x2 = LpVariable("Beef_percentage", lowBound=0, upBound=None)

	problem += 0.013x1 + 0.008x2, "Total Cost"

	problem += x1 + x2 == 100, "Total Sum"
	problem += 0.1x1 + 0.2x2 >=8, "Protine Constraint"
	problem += 0.08x1 + 0.1x2 >=6, "Fat Constraint"
	problem += 0.001x1 + 0.005x2 <=2, "Fiber Constrint"
	problem += 0.002x1 + 0.005x2 <=0.4, "Salt constraint"


	problem.writeLP("minimize_food_cost.lp")

	problem.solve()


	print "Solution found was :", LpStatus[problem.status]

	print "Here is the proportional of chick and beef that should be used"
	for v in problem.variables():
	print "\t",v.name, " : ", v.varValue


	print "Cost will be : ", value(problem.objective)

view raw

pulp_LP.py

hosted with ❤ by GitHub

	Solution found was : Optimal
	Here is the proportional of chick and beef that should be used
	Beef_percentage : 66.666667
	chicken_percentag : 33.333333
	Cost will be : 0.966666665

view raw

std_output.txt

hosted with ❤ by GitHub

ARIMA model

April 3, 2018April 3, 2018Archit Vora Leave a comment

ARMA to ARIMA

When there is a trend in data we take differences
ARIMA – Auto regressive Integrated Moving average
Integrated term includes order of difference, In the example below it is d=2

arima

Below is the sample github gist and output pdf is avaialble at ARIMA model.pdf

	—
	title: "ARIMA model"
	author: "Archit Vora"
	date: "April 3, 2018"
	output: html_document
	—

	“`{r setup, include=FALSE}
	knitr::opts_chunk$set(echo = TRUE)
	“`

	We will try to fit ARIMA model to BJsales dataset from 'dataset' package in R.
	Here is the plot of it.

	“`{r}
	library(datasets)
	plot(BJsales)
	“`

	Mean in changing over time and seems time series is not statioary. Let's take the difference.

	“`{r}
	plot(diff(BJsales))
	“`

	Seems It is still not staionary, let's take one more diff.

	“`{r}
	plot(diff(diff(BJsales)))
	“`
	Now it seems stationary. Let's plot acf and pacf of this doubly differenced series.

	“`{r}
	focus<- diff(diff(BJsales))
	acf(focus)
	pacf(focus)
	“`

	From ACF lag 1, 8 and 11 seems significant, while from PACF seems lag 1, 2, 3, 10 and 19 seems significant.

	Keeping parsimonious principle in mind we shall consider order of 0 and 1 for MA terms and oreder of (0, 1, 2, 3) for AR terms.

	Now let's try differenct models and check their AIC.

	“`{r}
	d <- 2
	for (p in 0:3){
	for (q in 0:1){
	if(p+d+q<6){ # Ensures simple model
	mm<- arima(x = BJsales, order = c(p, d, q))
	pval <- Box.test(mm$residuals, lag = log(length(mm$residuals)))
	sse <- sum(mm$residuals^2)
	aic <- mm$aic
	cat(p, d, q, "AIC = ", aic, "SSE = ", sse, "P-value = ", pval$p.value, "\n")
	}
	}
	}

	“`

	Seems like (0, 2, 1) has smallest AIC but does not have significant p-value.
	Let's leave it for this post and plots residual.

	“`{r}
	model <- arima(x = BJsales, order = c(0, 2, 1))
	par(mfrow = c(2, 2))
	plot(model$residuals)
	acf(model$residuals)
	pacf(model$residuals)
	qqnorm(model$residuals)
	“`
	There seems no significant correlation as well as QQ-plot also looks okay.
	Now let's plot the forecast.

	“`{r}
	library(forecast)
	fc <- forecast(model, h=20)
	plot(fc)
	“`

	All this can also be done by auto.arima routine of forecast package.
	“`{r}
	model <- auto.arima(BJsales, seasonal = FALSE)
	model
	fc <- forecast(model, h = 20)
	plot(fc)
	“`


	auto.arima has come up with different model, but it is okay.

view raw

arima.Rmd

hosted with ❤ by GitHub

Fitting AR Processes

March 31, 2018April 3, 2018Archit Vora Leave a comment

Yule Walker Equation in Matrix Form

ym1

If we write and above equation for k=1, 2, . . ., n and use the fact that ρ(k) = ρ(-k), we can write it in a matrix form.
Using the data we have we can estimate values of ρ (auto correlation coefficients)
acf() routine in R gives us that
Using values of ρ we can then estimate values of Φ (parameters of AR process)

ym2

Above is an example for AR process
We can solve these equation for values of Φ1, Φ2 and Φ3

Reference:

https://www.coursera.org/learn/practical-time-series-analysis/home/week/4

Moving average and Auto-regressive Processes

March 24, 2018July 26, 2018Archit Vora 1 Comment

Moving Average Processes MA(q)

Stock price depends on announcements of last two days
Auto correlation function cuts off at q

maq

Auto regressive Processes AR(p)

Below are the plots for AR(2) process
Depending upon the value of phi1 and phi2 ACF has alternative positive and negative values

Writing AR(p) process as MA process by substituting values of X(t-1). And yes phi is constant, we don’t need phi1, phi2 anymore.

Mean, variance and auto-correlation of AR(p) process, we have assumed Z = Norm(0, sigma2)

ACF of AR-p using Yule-Walker Equation

It is a method of solving difference equation in recursive relation
We first obtained auxiliary equation (also known as characteristic equation) which is polynomial and find root of that
Using these root we get weighted geometric series and find weights using some initial condition
We had learned in mathematics that this way of solving difference equation also related to solving differential equations
In the course they had solved it for Fibonacci series and root had come out to be golden ratio
For AR(p) ACF comes out to be difference equation, solving which can give us ACF for different values of lag

Reference

https://www.coursera.org/learn/practical-time-series-analysis/home/welcome

https://en.wikipedia.org/wiki/Autoregressive%E2%80%93moving-average_model

Stationarity Conditions for MA(q) and AR(p) Processes

March 19, 2018April 3, 2018Archit Vora 1 Comment

Sequence and Series

Convergent Sequence

1/2, 2/3, 3/4, . . . , n/(n+1)

Divergent Sequence

3, 9, 27, . . . . , 3^n

Series => Partial Sum of sequence

Convergent Series => if sum converges

Convergence Test

Integral Test
Comparison Test
Limit comparison test
Alternating Series Test
Ratio test
Root test

Geometric Series

a, ar, ar^2, . . . , ar^n
Convergent if r < 1

Representing function as (geometric) series

seriesRepresntation

Backward shift operator

B^kX(t) = X(t – k)

backOp

Invertibility

Two models have same ACF
Given ACF how to find out the model
We will go for model that is invertible
We can invert MA(1) into AR(∞)
Inverting is basically act of expanding function in geometric series
It is possible when growth r<1
Out of two models only one satisfies this condition
We will select that model given ACF

Conditions for Invertibility[MA(q)] and Stationarity [AR(p)]

dual

How to check if series is both invertible and stationary

Check AR(p) polynomial for stationarity
Check MA(q) polynomial for invertibility
Both should hold

Reference

https://www.coursera.org/learn/practical-time-series-analysis/exam/ITocA/series-backward-shift-operator-invertibility-and-duality

[Time Series] Correlation and Stationarity

March 18, 2018April 3, 2018Archit Vora Leave a comment

Co-variance vs Correlation

Correlation is co-variance divided by standard deviation of both variables
Hence it is independent of units and is always between -1 and 1, which makes comparison easier
Formula on the right is time series specific
- It is auto correlation coefficient at lag k
- It is define as ration of auto-correlation at lag k divide by auto-correlation at lag 0
- This values are plotted on correlogram (See one for MA(2) process below)

acf

Stationary Time Series

No systematic change in mean (No trend)
No systematic change in Variance
No periodic variation (Seasonality)

If time series is not stationary we apply several transformation to make it stationary.

For example applying difference operator to random walk makes it stationary.

Random Walk

Previous value of noise
If first value is zero then current value is summation of all the noises so far
X(t) = X(t-1) + Z(t)
Z(t) = Normal (mu, sigma2)
if X(0) = 0 then X(t) = sum(Z(k)) k form 0 to t
Expectation[X(t)] = t*mu – – Changes with time
Variance[X(t)] = t*sigma2 – – Changes with time
Not a stationary process
let Y(t) = X(t) – X(t-1) = Z(t) – – Y(t) is a stationary process

Example of Stationary Process

Moving average and Auto regressive processes described here can be stationary under conditions described here.

References

Iterative Method for Unconstrained Optimization

March 11, 2018August 29, 2020Archit Vora 1 Comment

There are two things

Newton method for finding roots
1. Does not have second order derivative in denominator
Using 1 to solve for minima problem
1. Minima problem is same as finding root of derivative
2. Has second order derivative in denominator
There is a nice derivation starting from taylor series [3]

Newton’s Method

Based on Taylor series expansion
Advantages
- Convergence is rapid in general, quadratic near optimal point
- Insensitive to no of variables
- Performance does not depend on choice of parameter
  - Gradient method depends on learning rate
Disadvantages
- Cost of computing and storing Hessian
- Cost of computing single newton step
  - You need double derivative (Example in note is a simple root finding problem)

Gradient Descent

Very popular method and does not need any write up
Exhibits approximately linear convergence
Advantage
- Very simple to implement
Disadvantage
- Convergence rate depends on number of the Hessian
- Very slow when for large no of variables (say 1000 or more)
- Performance depends on choice of parameters like learning rate

Gradient Descent vs Netwon’s Method

Edit 1 : I believe term step is more suitable here. Gradient step and Newton step. Method is more suitable for interior point methods, active set methods, cutting plane methods and proximal methods.

Newton’s method for finding roots:

In optimization we are essentially finding roots of derivative.

We are approximating function using Taylor series. We are finding minimum of this approximated function. This root will be our new guess. This perspective is used in derivation.

Andrew N.G Lecture [2]

Gradient Descent	Newton
Simpler	Slightly more complex (Requires computing and inverting hessian)
Needs choice of learning rate alpha	No parameters (third point in image is optional )
Needs more iteration	Needs fewer iteration
Each iteration is cheaper O(n) where n is no of features	Each iteration is costly. Hessian is (n+1) * (n+1). Inverting matrix is roughly O(n^3)
Use when no of features are less (n<1000)	Use when (n > 10,000)

Gradient can only give you direction. If you want to know next point (or step size), second order derivative (hessian) is must.

Golden Section Search

Typically applicable for one dimension only
We used it to calculate mobile/tablet adjustments
- As a good practice we had avoided recursion and took at max 20 iteration breaking loop with some criterion
Applicable for strictly unimodal function
Three points that maintain golden ratio (phi)
Following two problems are different :
- Searching in rotated sorted array
- Finding a max on uni-modal function
- Formal has a property that left one end will be smaller than other, which later does not have
  - This property is used in binary search
What is special about golden ration ?
- https://www.quora.com/What-is-so-special-about-the-golden-ratio

Bisection method is okay to find root, but for finding extreme golden section is preferred
- http://www.aip.de/groups/soe/local/numres/bookcpdf/c10-1.pdf
Sample code :

	from math import sqrt
	phi = (1 + sqrt(5))/2
	resphi = 2 – phi

	# a and b are the current bounds; the minimum is between them.
	# c is the center pointer pushed slightly left towards a
	def goldenSectionSearch(f, a, c, b, absolutePrecision):
	if abs(a – b) < absolutePrecision:
	return (a + b)/2
	# Create a new possible center, in the area between c and b, pushed against c
	d = c + resphi*(b – c)
	if f(d) < f(c):
	return goldenSectionSearch(f, c, d, b, absolutePrecision)
	else:
	return goldenSectionSearch(f, d, c, a, absolutePrecision)

	f = lambda x: x**2

	def test_search():
	assert abs(goldenSectionSearch(f, -1, (-1 + resphi*2), 1, 1e-10)) < 1e-10
	print "OK!"

	if __name__ == '__main__':
	test_search()

view raw goldensection.py hosted with ❤ by GitHub

Reference

[0] http://www.ece.mcmaster.ca/~xwu/part4.pdf

*[1] https://www.youtube.com/watch?v=42zJ5xrdOqo

*[2] https://www.youtube.com/watch?v=iwO0JPt59YQ

[3] https://suzyahyah.github.io/calculus/optimization/2018/04/06/Taylor-Series-Newtons-Method.html

Time series week 1

March 10, 2018March 30, 2018Archit Vora Leave a comment

Plotting in R
Linear regression properly fitted or not
- Residue are important thing to observed
- Q-Q plots for normality test
- Residues over time
  - Zoomed in residues over time
Hypothesis test
- One, two sided t test
- Confidence interval
  - Where we think mean lies
  - If it dose not contain 0 we tend to reject null hypothesis (Very broad statement, but I think you got the concept)
Correlation function
- Which quarter data false

Ref : https://www.coursera.org/learn/practical-time-series-analysis/home/welcome

On Clustering

February 1, 2018November 14, 2018Archit Vora Leave a comment

K-mean is probably most popular algorithm and most taught algorithms in academia. However it has got many limitation and listing some of them here:

You need to specify value of k
Can cluster non-clustered data
Sensitive to scale
Even on perfect data sets, it can get stuck in a local minimum
Means are continuous
Hidden assumption: SSE is worth minimizing
K-means serves more as quantification

In Hierarchical clustering you don’t need to specify values of k, you can sample any level from the tree it build either by top down or bottom up approach. Such a tree is called Dendrogram.

Scikit also supports variety of clustering algorithms including DBSCAN and lists which one suits when. http://scikit-learn.org/stable/modules/clustering.html

References:

https://stats.stackexchange.com/questions/133656/how-to-understand-the-drawbacks-of-k-means

No Free Lunch (NFL)

February 1, 2018May 18, 2023Archit Vora Leave a comment

When averaged across all possible situations, every algorithm performs equally well 🙂

Imagine you have two different classification problems: one involves identifying fraudulent credit card transactions, and the other involves diagnosing medical conditions based on patient symptoms. These two problems have distinct characteristics and requirements.

Now, let’s say you have Algorithm A, which performs exceptionally well on the credit card fraud detection problem. It has been extensively optimized and fine-tuned for this specific problem domain, resulting in high accuracy and low false positive rates.

On the other hand, you have Algorithm B, which excels in diagnosing medical conditions. It has been trained on a vast dataset of patient records and has been shown to provide accurate diagnoses with minimal errors.

If you were to switch the algorithms and apply Algorithm A to the medical diagnosis problem and Algorithm B to the credit card fraud detection problem, you may encounter subpar performance. Algorithm A, which was designed specifically for fraud detection, might struggle to correctly identify medical conditions based on symptoms. Similarly, Algorithm B, which was trained on medical data, may not be effective at detecting credit card fraud.

This example illustrates the essence of the No Free Lunch theorem. While Algorithm A and Algorithm B excel in their respective problem domains, they do not have universal superiority. Each algorithm is designed and optimized for a specific problem, taking into account its unique characteristics and requirements.

The NFL theorem reminds us that when approaching a new problem, it is essential to carefully analyze its specific characteristics, data distribution, and requirements. It is necessary to evaluate and select an algorithm or approach that aligns well with those specific factors, rather than assuming that a single algorithm can provide optimal performance across all problem domains.

In the context of the No Free Lunch (NFL) theorem, an algorithm refers to a specific computational method or procedure used to solve a problem or perform a task. Algorithms can be mathematical, statistical, heuristic, or machine learning-based, among others. Example being Logistic regression, CNN, RNN etc.

In summary, the No Free Lunch theorem emphasizes the importance of considering problem-specific factors and carefully selecting algorithms or approaches that suit the unique requirements of each problem. Different problems may require different algorithms, and there is no universally superior algorithm that can perform optimally for all possible problems.

http://www.no-free-lunch.org/

http://scholar.google.com/scholar?q=”no+free+lunch”+Wolpert”

Data Stories

Page 7 of 10

Simple Linear Program

ARIMA model

Fitting AR Processes

Yule Walker Equation in Matrix Form

Moving average and Auto-regressive Processes

Moving Average Processes MA(q)

Auto regressive Processes AR(p)

ACF of AR-p using Yule-Walker Equation

Reference

Stationarity Conditions for MA(q) and AR(p) Processes

Sequence and Series

Backward shift operator

Invertibility

Conditions for Invertibility[MA(q)] and Stationarity [AR(p)]

How to check if series is both invertible and stationary

Reference

[Time Series] Correlation and Stationarity

Co-variance vs Correlation

Stationary Time Series

Random Walk

References

Further reading

Iterative Method for Unconstrained Optimization

Newton’s Method

Gradient Descent

Gradient Descent vs Netwon’s Method

Golden Section Search

Reference

Time series week 1

On Clustering

No Free Lunch (NFL)