# π Word Representation

This document may contain incorrect info πβΌ Please open a pull request to fix when you find a one π

• One Hot Encoding

• Featurized Representation (Word Embedding)

• Word2Vec

• Skip Gram Model

• GloVe (Global Vectors for Word Representation)

# π One Hot Encoding

A way to represent words so we can treat with them easily

## π Example

Let's say that we have a dictionary that consists of 10 words (π€­) and the words of the dictionary are:

• Car, Pen, Girl, Berry, Apple, Likes, The, And, Boy, Book.

Our $X^{(i)}$ is: The Girl Likes Apple And Berry

So we can represent this sequence like the following π

Car   -0)  β 0 β   β 0 β   β 0 β   β 0 β  β 0 β   β 0 β Pen   -1)  | 0 |  | 0 |  | 0 |  | 0 |  | 0 |  | 0 |Girl  -2)  | 0 |  | 1 |  | 0 |  | 0 |  | 0 |  | 0 |Berry -3)  | 0 |  | 0 |  | 0 |  | 0 |  | 0 |  | 1 |Apple -4)  | 0 |  | 0 |  | 0 |  | 1 |  | 0 |  | 0 |Likes -5)  | 0 |  | 0 |  | 1 |  | 0 |  | 0 |  | 0 |The   -6)  | 1 |  | 0 |  | 0 |  | 0 |  | 0 |  | 0 |And   -7)  | 0 |  | 0 |  | 0 |  | 0 |  | 1 |  | 0 |Boy   -8)  | 0 |  | 0 |  | 0 |  | 0 |  | 0 |  | 0 |Book  -9)  β 0 β   β 0 β   β 0 β   β 0 β  β 0 β   β 0 β

By representing sequences in this way we can feed our data to neural networksβ¨

• If our dictionary consists of 10,000 words so each vector will be 10,000 dimensional π€

• This representation can not capture semantic features π

## π Featurized Representation (Word Embedding)

• Representing words by associating them with features such as gender, age, royal, food, cost, size.... and so on

• Every feature is represented as a range between [-1, 1]

• Thus, every word can be represented as a vector of these features

• The dimension of each vector is related to the number of features that we pick

### π’ Embedded Matrix

For a given word w, the embedding matrix E is a matrix that maps its 1-hot representation $o_w$ to its embedding $e_w$ as follows:

β$e_w=Eo_w$β

• Words that have the similar meaning have a similar representation.

• This model can capture semantic features β¨

• Vectors are smaller than vectors in one hot representation.

TODO: Subtracting vectors of oppsite words

## π Word2Vec

• Word2vec is a strategy to learn word embeddings by estimating the likelihood that a given word is surrounded by other words.

• This is done by making context and target word pairs which further depends on the window size we take.

• Window size: a parameter that looks to the left and right of the context word for as many as window_size words

Creating Context to Target pairs with window size = 2 π

## Skip Gram Model

The skip-gram word2vec model is a supervised learning task that learns word embeddings by assessing the likelihood of any given target word t happening with a context word c. By noting $ΞΈ_{t}$ a parameter associated with t, the probability P(t|c) is given by:

β$P(t|c)=\frac{exp(\theta^T_te_c)}{\sum_{j=1}^{|V|}exp(\theta^T_je_c)}$β

Remark: summing over the whole vocabulary in the denominator of the softmax part makes this model computationally expensive

## π§€ GloVe

The GloVe model, short for global vectors for word representation, is a word embedding technique that uses a co-occurence matrix X where each $X_{ij}$ denotes the number of times that a target i occurred with a context j. Its cost function J is as follows:

β$J(\theta)=\frac{1}{2}\sum_{i,j=1}^{|V|}f(X_{ij})(\theta^T_ie_j+b_i+b'_j-log(X_{ij}))^2$β

where f is a weighting function such that $X_{ij}=0$ βΉ $f(X_{ij})$ = 0. Given the symmetry that e and ΞΈ play in this model, the final word embedding e $e^{(final)}_w$ is given by:

β$e^{(final)}_w=\frac{e_w+\theta_w}{2}$β

## π©βπ« Conclusion of Word Embeddings

• If this is your first try, you should try to download a pre-trained model that has been made and actually works best.

• If you have enough data, you can try to implement one of the available algorithms.

• Because word embeddings are very computationally expensive to train, most ML practitioners will load a pre-trained set of embeddings.