Central limit theorem

What does CLT says ?

  • Sum of random samples forms normal distribution
    • This samples may not come from normal distribution
  • Sum forming random distribution implies that mean would also form normal distribution

Straight facts

  • Central limit theorem helps getting confidence interval for parameters
  • It works for all distributions when n > 30
  • For normal distribution it works even if n < 30
  • Why do we need to have distribution
    • To make variance estimation stable
    • We want to have just one unknown that is mean
    • We need to test normality of samples before applying t-test

Slide from MIT course : https://ocw.mit.edu/courses/mathematics/18-650-statistics-for-applications-fall-2016/

sigma / (sqrt(n)) is standard error of mean. We are saying this distribution reaches to standard normal distribution.

Law of Larger Number

  • As a sample size grows, its mean gets closer to the average of the whole population. This is due to the sample being more representative of the population

Example:

  • During significance testing we calculate left hand side. For examples testing fairness of coin that number comes out to be 3.54. Now for standard normal 3*sigma = 3*1 = 3 is 99 % of area. We are further away than it. So we can reject null hypothesis. [1]
  • Thing to understand is that distribution of Bernoulli parameter(p) is normal.
  • We are not saying how far observed mean is from 0.5 in Bernoulli distribution. If we were doing that we would not have used sqrt(n).
    • Also more importantly Bernoulli can take only two values 0 and 1. From that perspective as well it does not make sense.
    • See the equation in the slide below in central limit theorem. It is a normal distribution N(0,1).

Refereces

[0] : Slide from MIT course : https://ocw.mit.edu/courses/mathematics/18-650-statistics-for-applications-fall-2016/

[1] : https://ocw.mit.edu/courses/18-650-statistics-for-applications-fall-2016/resources/mit18_650f16_parametric_ht/

Types of Statistical Studies

We study to learn something new. The word “study” in statistics implies conducting an experiment and analyzing data to learn something new, investigate something, or draw confident conclusions.

Studies are prevalent in medical fields, where people study various types of drugs on different demographics, geographies, and health conditions. This blog essentially contains my notes from the Coursera course: “Clinical Research” (https://www.coursera.org/learn/clinical-research/home/welcome).

Types of Studies:

  1. Observational Studies:
    • Case Series Study: Observes and describes subjects without requiring a research hypothesis. Numbers derived from such studies help remove inherent biases. These studies often serve as initial steps for complex studies.
    • Case Control Studies: Compare two or more groups based on the presence or absence of a disease. These studies look at historical data to identify variables that differ between the groups. Confounding, such as smoking in a study on alcohol and heart attacks, needs to be controlled for.
      • Many times when you are presenting analysis experienced seniors would ask what was the value of this feature in both cases ?
    • Cross-Sectional Studies: Conducted in the form of surveys, gathering data at a specific time. For example, a survey sent to optometrists and ophthalmologists to understand their dietary advice to patients. These studies aim to examine current practices and identify areas for improvement. (Reference: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3695797/)
      • Drawbacks
        • Responsive bias
          • Suppose you are asking questions related to HIV. Positive patients are less likely to answer than negative ones.
          • In govt survey people in cities are more likely to answer than villages
        • Almost impossible to infer causality
          • Since this takes place at a particular place in time, we can not determine whether disease outcome followed exposure or exposure followed disease.
    • Cohort Studies: Identify a group of subjects (cohort) and follow them either backward in history (retrospective cohort) or forward in the future (prospective cohort). Computerized data collection has made retrospective cohorts possible. These studies are observational and do not involve controlling variables.
  2. Experimental Studies (Interventional):
    • In these studies, interventions are implemented to reduce bias inherent in observational studies. There is a control group that receives no intervention (sham/placebo).
    • Some Key Terms:
      • Randomization: Every member of the population should have an equal opportunity to be part of the study, and participants should have an equal chance of being assigned to any group.
      • Blinding: Participants are unaware of their assigned groups. If researchers are also unaware of the groups, it is called a double-blind study. Achieving double-blindness is challenging in surgical operations.

Reservoir sampling

Where it is used

  • Suppose you have a streaming data and you want to randomly sample from it. You don’t know how many items will be coming in.
  • You have large set of items to sample from and you want to do it in single pass

How it is done

  • Suppose you want to sample 1000 items.[1]
  • You take first 1000 items and put it into reservoir
  • Next you will take 1001th item with probability 1000/1001
    • You take a random number and if it is less than 1000/1001, you add this item to reservoir
    • Remember the CDF trick
  • When you add this item, you randomly remove any other item from reservoir

Alternative

  • Pick items from stream, generate a random no and put it in priority queue
  • This is how order by rand() in sql works [1]

References

[1] https://gregable.com/2007/10/reservoir-sampling.html

[2] https://www.youtube.com/watch?v=Ybra0uGEkpM (proof)

log-linear and log-log regression

In linear regression taking log is popular way to make relationship linear. We can take log of either response or predictor or both. This gives us four classes [0]

log_llog

Interpretations of β

  • Linear Model
    • the coefficient β gives us directly the change in Y for a one-unit change in X
  • Linear Log Model
    • β is the expected change in Y when X is multiplied by e (natural log)
  • Log Linear Model
    • Each 1-unit increase in X multiplies the expected value of Y by e β
  • Log Log Model
    • multiplying X by e will multiply expected value of Y by e βˆ

I have coded notebook to see the curves for all four at [1]. 

A note on Normalisation

  • Suppose you need to normalise a data to bring it between 0 and 1. This will be feed into some linear function (say ranking function without supervised response y). If you data is exponentially distributed instead of dividing by max, you can try log(sample)/log(max).  This is like feature transformation by taking log and then normalising it.
    • That’s about normalising variable which seems to have exponential distribution
  • Also when you check correlation with response variable and you see plots like [1] you know which transformation to take.
References :

[0] https://kenbenoit.net/assets/courses/ME104/logmodels2.pdf
[1] https://github.com/arcarchit/datastories/blob/master/notebooks/log_linear_models.ipynb

From highscalability.com

Highscalability.com has a plethora of case study blogs that I have learned from. Here are some key takeaways:

Deep Learning in Production

  • Batching should be performed at the latest possible stage in the processing chain, specifically for inferencing on GPUs. However, maintaining a certain response time Service Level Agreement (SLA) is essential. While it is not scheduled, we should batch whenever an opportunity arises.
  • Data pipelines in Hive are different from this context as they are scheduled to run daily.
  • Gatekeeping: Limiting the number of requests to 10 at a time.
  • Suppose the inference time of GPT is 50ms (99th percentile), we would guarantee a response time of 5 seconds once the request is accepted. If not accepted, we send an HTTP code 429, indicating “too many requests.” If excessive 429 responses are observed, we can consider spawning new machines.
  • Unsolved problems:
    • Loading input and pre/post-processing tasks consume CPU, while the expensive GPU remains idle during this time.
    • In a pub/sub model, the message injection rate should match the consumption rate.

Uber

  • They shared “What I Wish I Knew (WIWIK),” which primarily focuses on their experience with microservices and the potential downsides.
  • During a talk, the last question was about handling the decoupling of microservices. The answer was that they strive to do their best with engineering practices, but sometimes decoupling challenges still occur.

KL Divergence

We need to understand this four terms from information theory:

  1. Self Information
  2. Entropy
  3. KL Divergence
  4. Cross Entropy

Self Information

  • Information conveyed by any event
  • Higher the probability lower the information conveyedScreenshot 2019-11-24 at 4.38.31 PM

Entropy

  • Self information is of individual event, entropy is of distribution
  • Entropy is expected information of an event drown from that distribution
  • Distribution that are closer to uniform have highest entropy
    • Distribution that are nearly deterministic (where outcome is almost certain) have lower entropy
  • If log base is 2, it represents no of bites needed on average to encode symbols drawn from distribution of P. However this intuition is used prominently in communication theory than in machine learning.

Screenshot 2019-11-24 at 4.38.37 PM

KL Divergence

  • If we have two distribution P and Q of same random variable x, it tell how different this two distributions are

Screenshot 2019-11-24 at 4.38.43 PM

  • Extra amount of information (bits in base 2) needed to send a message containing symbols from P, while encoding was design for Q
  • KL divergence is always positive
    • It can be greater than 1
    • Bits required to encode information can be greater than 1
  • Example at [1]:
    • we have observed few events and we have the observed distribution
    • We want to represent is by some standard distribution say uniform or binomial.
    • Which one we should choose ?
      • The one for which KL divergence is minimum
      • This would same as to say the one for which extra information is minimum.
    • Binomial distribution has parameter p (probability of event = 1). Which value of this parameter we should choose ?
      • The one for which KL divergence is minimum.
      • It thus becomes an optimization problem
  • KL divergence is sometime termed as distance between two distribution
    • But it is not symmetric KL(P|Q) is not same as KL(Q|P)
  • Application :
    • One use case is in variational auto encoders (VAE) [2]
      • VAE generates sample data like GAN
      • For that we want output of encoder to be more generic before giving it to decoder
        • We measure this by measuring KL divergence of encoder output and uniform distribution
        • We add it in the final loss function

Cross Entropy

  • KL divergence measure extra information (bits) needed to encode P with symbols optimised for Q
  • Cross entropy measures total information needed to encode P with symbols optimised for Q
  • Formula for log-loss is exactly same (it is also called cross entropy loss)

Screenshot 2019-11-24 at 4.38.59 PM

Screenshot 2019-11-24 at 4.38.55 PM

Related

  • KS Test is used for goodness of fit
    • It is formulated in terms of hypothesis test and give p values
    • Based on empirical cumulative distribution function (empirical CDF)
  • KL divergence is differentiable
    • Popular in machine learning loss function
    • Based on information theory

References:

[0] : Deep Learning by Ian Goodfellow, http://www.deeplearningbook.org/

[1] : https://www.countbayesie.com/blog/2017/5/9/kullback-leibler-divergence-explained

[2]: https://towardsdatascience.com/intuitively-understanding-variational-autoencoders-1bfe67eb5daf

IR metrics

Information retrieval (IR) deals with fetching relevant documents given search query. Purpose of this blog is to list evaluation metric that can be used to measure performance of this system.

Metrics for unranked retrieval

For all these definition is same as that for classification. However they have some distinct characteristics for IR system.

Precision and Recall

  • Example of a system where precision is important
    • Web search
  • Example of a system where recall is important
    • Individual searching their hard disk

Accuracy

For IR system data is genrally very skewed. 99 % of the document are in non relvant category. Hence accuracy does not make sense.

F Measure

  • We can have F1 score, F2 score etc depending upon how much weight we want to precision and recall in harmonic mean.
  • As ß > 1, we start giving more weight to recall.

Metrics for ranked retrieval

Precision recall curves

  • Historically during classification we plot ROC by changing the threshold of binary classification. In case of IR we plot it by changing no of documents retrieved.

11 point interpolated average precision

  • For the recall values of (0,0.1,0.2,…,0.9,1.0) find out the precision and average it.

Mean average precision (MAP)

  • For a given query we calculate average precision. We take a mean of that for several queries.
  • Example:
    • There are 10 documents, document 1,2,5 are revalant.
    • System 1 retrieves : 1,5,4,6,7,2
      • Average precision = (1/1 + 2/2 + 2/6)/3 = 8/9 = 0.89
    • System 2 retrieves : 1,2,5,3,4
      • Average precision = (1/1 + 2/2 + 3/3)/3 = 1
    • System 3 retrieves : 6,7,1,2,3,4,5
      • Average precision = (1/3 + 2/4 + 3/7)/3 = 0.41
  • MAP values typically varies a lot for different query in the same system. Say between 0.1 to 0.7
  • For different systems and same query MAP values does not vary that much. Hence for testing which system is better using MAP, large no of queries are needed.

Precision at k

  • MAP measure precision at various recall levels (until all the documents are retrieved). For application like web search what matters is result on first page or first three pages.
  • Disadvantages
    • Least stable and does not average well
    • Reason : total no of relevant document for a query has a strong influence on this metric.

R precision

  • Same as precision at k where k = no of relevant documents in a given query
  • It adjusts for the relevant document for a query (disadvantage of precision at k)
  • Empirically R-precision and MAP turn out be highly correlated.

NDCG

  • Normalized Discounted Cumulative Gain
  • Like precision at k it is evaluated at some values of k
  • It constitutes of cumulative gain at each position which is discount by position and is normalized
    • It is not max or sum normalized
    • It is normlized by ideal ndcg, which is calculated by sorting document based on relevance score. [3]
  • NDCG ranges between 0 to 1. For perfect ranking ideal value of NDCG is 1.
  • Example
    • positions : 1 2 3 4 5
    • eval score : 2 1 2 0 1
NDCG(q, k) = DCG(k)/Ideal DCG(k)

DCG(k) = sum (Gain(i) * Discount factor(i)) for i in (1,k)

Discount factor is generally taken as 1/log(1 + pos)

Gain is generally takes as (2^r - 1) where r is relevance score given by humans say 0 - not relevant ,1-near relevant , or 2-relevant.

References

[0] Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze, Introduction to Information Retrieval, Cambridge University Press. 2008.

[1] https://towardsdatascience.com/breaking-down-mean-average-precision-map-ae462f623a52

[2] https://blog.thedigitalgroup.com/measuring-search-relevance-using-ndcg

[3] https://www.geeksforgeeks.org/normalized-discounted-cumulative-gain-multilabel-ranking-metrics-ml/

Geometry of Linear Equations – Column Picture

Notes from Prof. Gilbert Strang’s Lecture on MIT OpenCourseWare: The Geometry of Linear Equations

In linear algebra, when faced with equations, we often try to visualize them using the row picture.

Row Picture: In 2-D, we can think of it as a line and aim to find its intersection.

Column Picture: We aim to find the weights of a linear combination of columns. In the image on the right, we see the addition of two vectors. We start from the origin and add them at the tail.

mit_row_2d
mit_col_2d

As we move to higher dimensions, the row picture becomes more challenging to visualize, while the column picture remains straightforward.

mit_3d

Furthermore, the column picture allows us to check if the combination of all columns fills the entire space. We can verify this through elimination.

mit_subspace

Additionally, we can develop a habit of viewing matrix multiplication as a linear combination of columns.

mit_mm

Correlation and Regression Slope

In a simple regression model, the regression slope (β) represents the estimated change in the dependent variable (Y) corresponding to a one-unit increase in the independent variable (X). It quantifies the linear relationship between X and Y and indicates the direction and magnitude of the relationship.

The correlation coefficient (r) measures the strength and direction of the linear relationship between X and Y. It ranges from -1 to +1, with a value of 1 indicating a perfect positive linear relationship, -1 indicating a perfect negative linear relationship, and 0 indicating no linear relationship.

When the standard deviations of both X and Y are equal (SD(X) = SD(Y)), the regression slope (β) and the correlation coefficient (r) coincide.

The slope can be calculated as the correlation coefficient multiplied by the ratio of the standard deviations (β = r * SD(Y) / SD(X)). The correlation coefficient essentially represents the slope you would obtain from a regression of standardized variables (Y / SD(Y) on X / SD(X) or vice versa).

However, when the standard deviations of X and Y are not equal, the regression slope and the correlation coefficient provide distinct information:

  1. The correlation coefficient is a bounded measure that can be interpreted independently of the scale of the variables. It indicates the strength of the linear relationship between X and Y, with values closer to ±1 indicating a stronger linear relationship. The regression slope, on its own, does not provide this information.
  2. The regression slope represents the estimated change in the expected value of Y for a given unit increase in X. It provides information about the direction and magnitude of the relationship between X and Y in the original units of measurement. This information cannot be deduced from the correlation coefficient alone.

One more thing to add here is the relationship between correlation coefficient and co-variance. Formula is : r = Covariance (Y, X) / [ SD(Y) * SD(x) ]. We are normalising by SD of each variable. Also SD = sqrt ( variance ). We can also say that b = Covariance(X,Y) / VAR(X)

References

[0] : https://stats.stackexchange.com/questions/32464/how-does-the-correlation-coefficient-differ-from-regression-slope

[1] : https://www.quora.com/Is-there-a-relationship-between-the-correlation-coefficient-and-the-slope-of-a-linear-regression-line

Softmax and cross entropy Loss

Softmax:

  • To explain softmax, Andrew Ng uses the terms “hard-max” and “soft-max.”
  • Softmax calculates the output probabilities of various classes using the formula: y_pred = exp(z_i) / sum_over_i ( exp(z_i) ).
  • Softmax outputs the probability distribution of the classes.
  • In hardmax, we assign one class as 1 and the others as 0.

Cross Entropy:

  • Cross entropy is a loss function commonly used in classification tasks.
  • The loss is calculated using the formula: Loss = - sum [y_actual * log(y_pred)].
  • For example, if the actual class is [1, 0, 0, 0, 0]:
    • y_pred_1 = [0.1, 0.5, 0.1, 0.1, 0.2]
    • y_pred_2 = [0.1, 0.6, 0.1, 0.1, 0.1]
  • The loss will be the same for y_pred_1 and y_pred_2.
  • This is a key feature of multiclass log loss: it rewards or penalizes the probabilities of correct classes only, and the value is independent of how the remaining probability is split between incorrect classes. [0]
  • Cross entropy is same as loss function of logistic regression, it is just that there are two classes.
8

References: