This document may contain incorrect info 🙄‼ Please open a pull request to fix when you find a one 🌟
One Hot Encoding
Featurized Representation (Word Embedding)
Skip Gram Model
GloVe (Global Vectors for Word Representation)
A way to represent words so we can treat with them easily
Let's say that we have a dictionary that consists of 10 words (🤭) and the words of the dictionary are:
Car, Pen, Girl, Berry, Apple, Likes, The, And, Boy, Book.
Our is: The Girl Likes Apple And Berry
So we can represent this sequence like the following 👀
Car -0) ⌈ 0 ⌉ ⌈ 0 ⌉ ⌈ 0 ⌉ ⌈ 0 ⌉ ⌈ 0 ⌉ ⌈ 0 ⌉Pen -1) | 0 | | 0 | | 0 | | 0 | | 0 | | 0 |Girl -2) | 0 | | 1 | | 0 | | 0 | | 0 | | 0 |Berry -3) | 0 | | 0 | | 0 | | 0 | | 0 | | 1 |Apple -4) | 0 | | 0 | | 0 | | 1 | | 0 | | 0 |Likes -5) | 0 | | 0 | | 1 | | 0 | | 0 | | 0 |The -6) | 1 | | 0 | | 0 | | 0 | | 0 | | 0 |And -7) | 0 | | 0 | | 0 | | 0 | | 1 | | 0 |Boy -8) | 0 | | 0 | | 0 | | 0 | | 0 | | 0 |Book -9) ⌊ 0 ⌋ ⌊ 0 ⌋ ⌊ 0 ⌋ ⌊ 0 ⌋ ⌊ 0 ⌋ ⌊ 0 ⌋
By representing sequences in this way we can feed our data to neural networks✨
If our dictionary consists of 10,000 words so each vector will be 10,000 dimensional 🤕
This representation can not capture semantic features 💔
Representing words by associating them with features such as gender, age, royal, food, cost, size.... and so on
Every feature is represented as a range between [-1, 1]
Thus, every word can be represented as a vector of these features
The dimension of each vector is related to the number of features that we pick
For a given word w, the embedding matrix E is a matrix that maps its 1-hot representation to its embedding as follows:
Words that have the similar meaning have a similar representation.
This model can capture semantic features ✨
Vectors are smaller than vectors in one hot representation.
TODO: Subtracting vectors of oppsite words
Word2vec is a strategy to learn word embeddings by estimating the likelihood that a given word is surrounded by other words.
This is done by making context and target word pairs which further depends on the window size we take.
Window size: a parameter that looks to the left and right of the context word for as many as window_size words
Creating Context to Target pairs with window size = 2 🙌
The skip-gram word2vec model is a supervised learning task that learns word embeddings by assessing the likelihood of any given target word t happening with a context word c. By noting a parameter associated with t, the probability P(t|c) is given by:
Remark: summing over the whole vocabulary in the denominator of the softmax part makes this model computationally expensive
The GloVe model, short for global vectors for word representation, is a word embedding technique that uses a co-occurence matrix X where each denotes the number of times that a target i occurred with a context j. Its cost function J is as follows:
where f is a weighting function such that ⟹ = 0. Given the symmetry that e and θ play in this model, the final word embedding e is given by:
If this is your first try, you should try to download a pre-trained model that has been made and actually works best.
If you have enough data, you can try to implement one of the available algorithms.
Because word embeddings are very computationally expensive to train, most ML practitioners will load a pre-trained set of embeddings.