[Paper] Semantic Product Search

Paper is related to search systems for e-commerce.
- It tries to replace BM-25 inverted index based recall with query-item embeddings.
- It is written by team at Amazon
- Paper URL – https://arxiv.org/pdf/1907.00937.pdf
Shortcoming of BM-25
- lack of understanding of hypernyms, synonyms, antonyms
  - hypernym – general/broad word
  - Color is hypernym for red
  - Examples
    - Red dress and burgundy dress
    - sneakers and running shoes
- Morphological variants (woman vs women)
  - Stemming and lemmatization is typical solution
  - Requires numerous hand-crafted rules
    - Converts reading glasses to reading glass
- Sensitivity to spelling error
  - Typically solved by spell correction algorithms

Architecture

Sharing embedding layer in siamese network work well, capturing local word level matching even before training this network.
Average pooling
- Comparing to LSTM performance of average pooling reduced by just 0.5 %
- Evaluation metric were MAP and Recall@100
- It helped computationally
- Unlike web search query and items titles have smaller length
- Query also does not have stop words
Batch Normalisation
- Length of items are generally longer than query
- They had observed difference in magnitude after avg pooling of query and item
- I have seen people using FC layer instead to have separate parameters

Loss Function

They first tried with 2 part hinge loss
Analysing the overlap items they found that they were clicked but not purchased.
- They note that from matching standpoint these products are okay to show to customer
So they introduced third term and decided it’s threshold to be 0
EmpIrical values used episolon (+) = 0.9, episilon (-) = 0.2
My view on this
- For the products that has clicks but not orders, new loss tries to keep it below 0.55
- 2 part hinge loss was for purchased vs random items
- Some of those random items had got clicks
- For positive items when we get similarity above 0.9 it should contribute 0 to loss
- For negative items if we get similarity below 0.2 it should contribute 0 to loss
- For neutral items if we get similarity below 0.55 it should contribute 0 to loss
y_hat is cosine similarity
- Based on which loss is calculated
- Instead of passing cosine similarity to sigmoid they are passing it through hinge function.

Tokenization Methods

Word unigram
word n-gram
- To preserve information about word ordering
- They preferred word n-gram over LSTM or CNN
Character trigram
- Robust typos (iphone & iphonr)
- Compound words (amazontv & firetvstick)
- Item parts and sizes
  - Tires have different sizes
  - TVs have different model
Handling unseen words
- Common NLP techniques
  - Mask the input
  - use same embedding for all unseen words
- What they choose
  - Keep more embedding space for unseen words
  - Hash them to get the bucket
- Enough space is allocated for frequent words and they are not hashed
  - 10k or 100k of n-grams are kept
Combining tokenization
- One approach
  - Separate embedding for unigram, biagram and character trigrams
  - Compute. weighted sum of cosine similarity individually
- Their approach
  - Consider them as bag or words

Data

Time range
- Training data – 11 month
- Eval data – 1 month
Trick of aggregating count as weight
- Weight while computing the loss, simple multiplication ( I had done that in synonyms as well )
- 54 billion query product training pairs
- Reduced to 650 million
Ratio
- 1 purchased
- 6 impressed
- 7 random
- They plan for both matching and ranking
- Matching -> Differentiate between (purchased + impressed) from random
- Ranking -> Differentiate between purchased and impressed
Query Preprocessing
- Changes queries to list of token IDs
- array length = 99 % tile of query token length
- Smaller tokens are padded to right
Item preprocessing
- One approach
  - Embed every attribute independently and concat
- Their approach
  - Ordered Bag of words for title and attributes
  - Issue was variability in attributes and lack of structured data across items

Model Evaluations

Evaluation Metrics

Matching
- Recall@100 and MAP
  - Used for tuning model hyper parameter as well
- 20k queries and sub-corpus for those queries
  - Around 1M items
  - Ordered + Impressed + Random
Ranking
- NDCG and MRR
- Purchased + Impressed

Evaluation Results

Table 1
- L2 variant of hinge loss outperforms L1
  - They hypothesise that L2 is more robust to outliers
- 3 part hinge loss outperforms 2 part in matching but not in ranking
  - Introduction of 3rd part forced purchased and random to be more separated
Table 2
- LSTM, GRU vs avg Pooling for aggregation
  - Comparison is made for various losses
  - Average pooling is performing similar or slightly better
    - Definitely reduces computation time
  - Since query and product title are relatively short this works well
  - Recurrent methods are expressive but introduces specialisation between query and title
    - Some may argue it is a good thing
    - They argue that it might not capture local word level mapping
Table 3
- Compares different tokenization methods
- Vocab sizes in embedding space
  - 125k – unigram
  - 25k – bigram
  - 64k – character trigram
  - 500k – Out of vocabulary (OOV) bins
- Advantage of adding bigrams
  - Example – Chocolate milk vs milk chocolate
- Advantage of character trigrams
  - Robustness to spelling error
- OOV adds additional parameters
  - case 1 : 500k unigrams
  - case 2 : 125k unigrams and 375k OOV
  - case 2 performs better
Table 4
- Batch vs layer vs no normalisation
- query sorted data vs shuffled data
- Optimum combination was using batch normalisation with shuffled data
- Query sorted data did not have much effect on layer normalisation but was on batch normalisation
  - Batch estimates for mean and variant would be biased for batch normalisation
Table 5
- Comparing with baseline papers
  - DSSM
    - Learning Deep Structured Semantic Models for Web Search using Clickthrough Data
    - https://posenhuang.github.io/papers/cikm2013_DSSM_fullversion.pdf
  - ARC-II
    - Convolutional neural network architectures for matching natural language sentences
    - https://dl.acm.org/doi/10.5555/2969033.2969055
  - Match Pyramid
    - Text Matching as Image Recognition
    - https://arxiv.org/pdf/1602.06359.pdf
  - Two variation of their approach
    - Frozen embedding
      - Using pretrained word2vec and glove
    - Randomly initialised embedding
      - Lead 3X improvement in Recall@100 and MAP

Online evaluations

Three categories
- toys and games, kitchen and pets
Conversion rate, revenue and other KPI increased with statistical significance
Challenges
- Precision was decreased
- To mitigate they added guardrails using some heuristic and ranking models to filter non relevant items
- Not that recall@100 was metric for matching task and not precision.

Training Acceleration

Increasing query product pairs from 200 million to 1.2 billion had improved matching metric by 10 %
Data parallel training
- Idea
  - Split data
  - Same model on multiple GPUs
  - Take the average of gradient in the end
- Cons for this architecture
  - Limits embedding metric
    - It has to fit on single GPU
  - Data parallel approach has high communication overhead
    - We need to weight till forward pass from each GPU is available
Model parallel training
- Pros in this architecture
  - Simplicity of average pooling
  - Siamese has some parallelism by default
- How they implemented
  - Embedding metric is split across GPUs, across embedding dimension
    - Hence we need to concat to get final embedding vector for each token
      - Concatenation requires O(2k) communication of floating point numbers
    - Partial token embeddings are averaged at each GPUs
  - Full dataset is sent to all GPUs
In the result section they have told that embedding dimension was 256

Explaining above image
- For low embedding dimension time-to-train increases with number of GPUs but for high embedding dimension time-to-train decreases with number of GPUs
- Dotted line
  - Following two should have same training time
    - 2048 embedding with 1 GPU
    - 1024 embedding with 2 GPU
  - Doted line is for the ration of embedding_size/GPU
  - With ideal scaling it would be horizontal

Data Stories

[Paper] Semantic Product Search

Architecture

Loss Function

Tokenization Methods

Data

Model Evaluations

Evaluation Metrics

Evaluation Results

Online evaluations

Training Acceleration

Exampels

Leave a comment Cancel reply

Architecture

Loss Function

Tokenization Methods

Data

Model Evaluations

Evaluation Metrics

Evaluation Results

Online evaluations

Training Acceleration

Exampels

Share this:

Related

Leave a comment Cancel reply