Negative sampling in word2vec

January 31, 2018January 31, 2018Archit Vora 1 Comment

In the precious post we talked about skipgram model.

https://datastoriesweb.wordpress.com/2018/01/31/word2vec-and-skip-gram-model/

Now let’s say we have 1000 words and 300 hidden units, we shall have 300,000 wights in both hidden and output unit, which are two many parameters.

Output label during training is one hot vector with 999 zeros and single 1. We randomly select 5 zeros and update weight for six words only. (5 zeros and single 1). The more frequent word is the highr the probability of it getting selected. Google paper has mentioned some emperical formula for this. This is know as negative sampling.

Above was for output layer. In hidden layer weights are updated only for input words. (Irrespective if it is negative sampling or not)

Reference:

http://mccormickml.com/2017/01/11/word2vec-tutorial-part-2-negative-sampling/

Word2Vec and skip gram model

January 31, 2018July 21, 2021Archit Vora 1 Comment

Skip gram model

Weights of hidden layer serves as word vectors
There is just one hidden layer and one output layer(softmax)
Hidden layer does not have any activation
As input is one-hot vector, output is also one-hot vector
- output of hidden layer would be corresponding word vector
In the below diagram:
- Size of input (1 x 10000)
- Size of output (1 x 10000)
- Weight of hidden layer (10000 x 300)
- Weight of output layer (300 x 10000)
- So too many weight to learn – solution : negative sampling
Training pair would be nearby words in predefined window
- We can imagine how huge can that be
- It is pair of words both one-hot encoded
- Sure, we need to know previously size of our vocabulary (which will be dimension of one-hot vector)
The paper google release was trained on google news data and used 300 dimension vector, which means 300 neuron in hidden unit. The paper lists this no and size of training words and efficiency.
- Not there is one more parameter called named window size which was set to 5.
- It means that 5 words before and after center words are considered as pair for training data.
There is no activation function on the hidden layer neurons, but the output neurons use softmax.

Why word2vec

Earlier NLP methods used to rely on synonyms/hypernyms which is not totally contextual
- Earlier case was mainly one hot encoding of vector
“proficient” is synonym of good only in some context
New words are getting added everyday
All words are one-hot encoded
- Somewhat similar word might be orthagonal
- Size of vector become too large

Role of TF-IDF

It is a scoring mechanism
Instead of average vectors of all the words in document we can have weighted average by TF-IDF score

There are two more things:

Continuous Bag Of Words
Negative sampling

CBOW

It also takes average of context words
- One argument in the favour that averaging is valid
Both CBOW and skip gram does not add non-linearity in hidden layer
- Output layer uses softmax
- Idea is that word-embedding is used to predict target word.

References :

http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/

http://cs224d.stanford.edu/

http://web.stanford.edu/class/cs224n/syllabus.html

http://nadbordrozd.github.io/blog/2016/05/20/text-classification-with-word2vec/

Quadratic Programming CVXOPT

November 26, 2017May 8, 2020Archit Vora Leave a comment

This is taken from https://courses.csail.mit.edu/6.867/wiki/images/a/a7/Qp-cvxopt.pdf

General algorithms for QP (from wikipedia)

interior point,
active set,^[2]
augmented Lagrangian,^[3]
conjugate gradient,
gradient projection,
extensions of the simplex algorithm.^[2]

Standard form

qp1

Converting to standard form

qp2

Python Code

Sorry, something went wrong. Reload?

Sorry, we cannot display this file.

Sorry, this file is invalid so it cannot be displayed.

<br /> Viewer requires iframe.<br />

view raw

QP_CVXOPT.ipynb

hosted with ❤ by GitHub

Derivation of backpropogation

November 25, 2017May 10, 2020Archit Vora Leave a comment

Quick Summary:

back_prop

Detailed Derivation:

Even if you look at gradient descent below, error is multiplied by previous value. When input is higher it’s contribution to error is higher and will needs to change more.

Derivation Of Backpropagation – 2

November 5, 2017November 25, 2017Archit Vora Leave a comment

References :

Pattern Recognition and Machine Learning by Bishop [Page no 244]
Andrew NG’s course by deeplearning.ai
https://sudeepraja.github.io/Neural/

SVM Solution Lagrange

November 1, 2017October 25, 2020Archit Vora Leave a comment

Reference :

Click to access cs229-notes3.pdf

Dynamic Programming for RL

October 22, 2017May 7, 2020Archit Vora 1 Comment

Dynamic Programming is one of the method to solve reinforcement learning problem. It assumes that complete dynamics of MDP are known and we are interested in

Finding value function for given policy (Prediction problem)
Finding optimal policy for given MDP (Control problem)

There are three things :

Policy Evaluation
- We calculate value of a given state
  - Average value given probability all actions defined in policy
- Takes Probability of actions into account
Policy Iteration
- First evaluates policy
- Then generates new policy based on this evaluation
- Calculates and updates maximum possible value of a state
  - Value is maximum when we take most optimal action
Value Iteration
- Policy evaluation is time taking process
- Let’s go with just one iteration
- We don’t need to update policy in each iteration, just keep using new value function does the job
- So it is like keep updating value function (choosing action that maximizes value instead of taking probabilities) until it converges and then selecting a policy based on this optimal values function
- I have coded it at [0]

What is policy ?

A policy is simply a probability of performing an action (a) when in state (s) and our goal is to tune the policy to maximize the agents reward.

Different type of DP

Unlike traditional algo-ds problems this is a different type of dynamic programming. There mostly final ans is output of one cell, other cells are used for memoization. Here final answer is all the cells. We want to know value of all the cells. Another unusual DP problems are mentioned at [4].

Policy Evaluation

policy_eval

Policy Iteration

policy_iter

Values Iteration

value_iter

Here are two slides from : http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/DP.pdf

References

[0] : https://github.com/arcarchit/datastories/blob/master/RL_GridWorld.ipynb

[1] : Sutton’s book

[2] : https://github.com/dennybritz/reinforcement-learning/tree/master/DP/

[3] : http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/DP.pdf

[4] : https://github.com/arcarchit/mit-ds-algo/blob/master/dsalgo/cs/dp/dp_equations.md

OpenAI Gym Environment

October 19, 2017May 7, 2020Archit Vora Leave a comment

Open AI provides framework for creating environment and training on that environment. In this post I am pasting a simple notebook for a quick look up on how to use this environments and what all functions are available on environment object.

I have used environment available on github by Denny Britz and here are the references :

References :

https://github.com/dennybritz/reinforcement-learning

Learning Reinforcement Learning (with Code, Exercises and Solutions)

https://gym.openai.com/docs/

My Code : https://gist.github.com/arcarchit/2b3363e2615df7ef5c8d4941d4dfa9e8

Sorry, something went wrong. Reload?

Sorry, we cannot display this file.

Sorry, this file is invalid so it cannot be displayed.

<br /> Viewer requires iframe.<br />

view raw

gym_env.ipynb

hosted with ❤ by GitHub

[Example] Lagrange Multiplier With Equality Constraints

September 10, 2017December 13, 2017Archit Vora Leave a comment

Stationary Point

Definition of stationary point from wikipedia :

In mathematics, particularly in calculus, a stationary point or critical point of a differentiable function of one variable is a point on the graph of the function where the function’s derivative is zero. Informally, it is a point where the function “stops” increasing or decreasing (hence the name).

Lagrange multiplier helps us to find all the stationary points, It can be local minima, local maxima, global minima or global maxima. Once we evaluate objective function at each of these stationary point we can classify which one is local/global minima and maxima.

Example

Thompson Sampling

September 2, 2017January 1, 2021Archit Vora 1 Comment

Thompson sampling is one approach for Multi Armed Bandits problem and about the Exploration-Exploitation dilemma faced in reinforcement learning. It is also know as posterior sampling.

Challenge in solving such a problem is that we might end up fetching the same arm again and again. Bayesian approach helps us solving this dilemma by setting prior with somewhat high variance.

Here is the code for two armed bandit. One has success probability of 40% (bandit 0) and another has 25% (bandit 1).

We are using beta distribution for deciding which arm to pull. Beta distribution has two parameter alpha and beta. Higher values of alpha, pulls distribution towards 1. Beta distribution is always confined between 0 and 1.

How we train is that for each feedback we receive we increment alpha by 1 if it was success or beta by 1 in case of failure. For choosing the arm we draw random sample from the distribution of each arm and select the arm with highest value.

	import numpy as np
	import scipy.stats as stats
	from matplotlib import pyplot as plt

	class BetaThompson:
	def __init__(self, num_bandits, prior_a, prior_b):
	self.num_bandits = num_bandits
	self.a = prior_a
	self.b = prior_b

	def learn(self, bandit, success):
	if success:
	self.a[bandit]+=1
	else:
	self.b[bandit]+=1

	def suggest(self):
	sampled_prob = []
	for i in range(self.num_bandits):
	dist = stats.beta(self.a[i], self.b[i])
	prob=dist.rvs()
	sampled_prob+=[prob]
	return sampled_prob.index(max(sampled_prob))

	class Gamble:
	def __init__(self, num_bandits, binom_mean):
	self.num_bandits=num_bandits
	self.binom_mean=binom_mean

	def gamble(self, bandit_no):
	dist=stats.binom(n=1,p=self.binom_mean[bandit_no])
	success = True if dist.rvs()>0.5 else False
	return success

	def main():
	g = Gamble(2, [0.40, 0.25])
	bot = BetaThompson(2, [1,1], [1,1])
	choice=[]
	for i in range(5000):
	suggestion = bot.suggest()
	result = g.gamble(suggestion)
	bot.learn(suggestion, result)
	choice+=[suggestion]

	plt.figure(figsize=(12,5))
	plt.plot(choice, 'b.')
	plt.ylim([-0.5,1.5])
	plt.title('Choice made (Bandit 0-0.40 prob and bandit 1-0.25 prob)')
	plt.xlabel('Draw #')
	plt.ylabel('Bandit (0 or 1)')
	plt.savefig('choice.png')

	if __name__== "__main__":
	main()

view raw thompson_sampling.py hosted with ❤ by GitHub

And here is simulation results. We see that initially both the the armed are pulled frequently but slowly arm 1 is pulled less and less, but it is never straight away zero.

Math Behind Increasing alpha and beta

In one line posterior is beta distribution for beta prior and Bernoulli likelihood. In other words beta is a conjugate prior for Bernoulli likelihood. List of various conjugate prior is available at [1].

PDF of beta distribution is simple once you think in terms of the effect of alpha and beta.

Look at note below, formula of beta distribution is not complex, actually it is very similar to

References

[1] : https://ocw.mit.edu/courses/mathematics/18-05-introduction-to-probability-and-statistics-spring-2014/readings/MIT18_05S14_Reading15a.pdf

Data Stories

Author Archit Vora

Negative sampling in word2vec

Word2Vec and skip gram model

Skip gram model

Why word2vec

Role of TF-IDF

There are two more things:

CBOW

References :

Quadratic Programming CVXOPT

Derivation of backpropogation

Derivation Of Backpropagation – 2

SVM Solution Lagrange

Dynamic Programming for RL

References

OpenAI Gym Environment

[Example] Lagrange Multiplier With Equality Constraints

Stationary Point

Example

Thompson Sampling

Math Behind Increasing alpha and beta

References