πWord Representation
Approaches of word representation
π Word Representation
This document may contain incorrect info πβΌ Please open a pull request to fix when you find a one π
One Hot Encoding
Featurized Representation (Word Embedding)
Word2Vec
Skip Gram Model
GloVe (Global Vectors for Word Representation)
π One Hot Encoding
A way to represent words so we can treat with them easily
π Example
Let's say that we have a dictionary that consists of 10 words (π€) and the words of the dictionary are:
Car, Pen, Girl, Berry, Apple, Likes, The, And, Boy, Book.
Our $$X^{(i)}$$ is: The Girl Likes Apple And Berry
So we can represent this sequence like the following π
By representing sequences in this way we can feed our data to neural networksβ¨
π Disadvantage
If our dictionary consists of 10,000 words so each vector will be 10,000 dimensional π€
This representation can not capture semantic features π
π Featurized Representation (Word Embedding)
Representing words by associating them with features such as gender, age, royal, food, cost, size.... and so on
Every feature is represented as a range between [-1, 1]
Thus, every word can be represented as a vector of these features
The dimension of each vector is related to the number of features that we pick
π’ Embedded Matrix
For a given word w, the embedding matrix E is a matrix that maps its 1-hot representation $$o_w$$ to its embedding $$e_w$$ as follows:
$$e_w=Eo_w$$
π Advantages
Words that have the similar meaning have a similar representation.
This model can capture semantic features β¨
Vectors are smaller than vectors in one hot representation.
TODO: Subtracting vectors of oppsite words
π Word2Vec
Word2vec is a strategy to learn word embeddings by estimating the likelihood that a given word is surrounded by other words.
This is done by making context and target word pairs which further depends on the window size we take.
Window size: a parameter that looks to the left and right of the context word for as many as window_size words
Creating Context to Target pairs with window size = 2 π
Skip Gram Model
The skip-gram word2vec model is a supervised learning task that learns word embeddings by assessing the likelihood of any given target word t happening with a context word c. By noting $$ΞΈ_{t}$$ a parameter associated with t, the probability P(t|c) is given by:
$$P(t|c)=\frac{exp(\theta^T_te_c)}{\sum_{j=1}^{|V|}exp(\theta^T_je_c)}$$
Remark: summing over the whole vocabulary in the denominator of the softmax part makes this model computationally expensive
π One Hot Rep. vs Word Embedding
𧀠GloVe
The GloVe model, short for global vectors for word representation, is a word embedding technique that uses a co-occurence matrix X where each $$X_{ij}$$ denotes the number of times that a target i occurred with a context j. Its cost function J is as follows:
$$J(\theta)=\frac{1}{2}\sum_{i,j=1}^{|V|}f(X_{ij})(\theta^T_ie_j+b_i+b'j-log(X{ij}))^2$$
where f is a weighting function such that $$X_{ij}=0$$ βΉ $$f(X_{ij})$$ = 0. Given the symmetry that e and ΞΈ play in this model, the final word embedding e $$e^{(final)}_w$$ is given by:
$$e^{(final)}_w=\frac{e_w+\theta_w}{2}$$
π©βπ« Conclusion of Word Embeddings
If this is your first try, you should try to download a pre-trained model that has been made and actually works best.
If you have enough data, you can try to implement one of the available algorithms.
Because word embeddings are very computationally expensive to train, most ML practitioners will load a pre-trained set of embeddings.
π§ References
Last updated