This document may contain incorrect info πβΌ Please open a pull request to fix when you find a one π

One Hot Encoding

Featurized Representation (Word Embedding)

Word2Vec

Skip Gram Model

GloVe (Global Vectors for Word Representation)

A way to represent words so we can treat with them easily

Let's say that we have a dictionary that consists of 10 words (π€) and the words of the dictionary are:

Car, Pen, Girl, Berry, Apple, Likes, The, And, Boy, Book.

Our $X^{(i)}$ is: **The Girl Likes Apple And Berry**

So we can represent this sequence like the following π

Car -0) β 0 β β 0 β β 0 β β 0 β β 0 β β 0 βPen -1) | 0 | | 0 | | 0 | | 0 | | 0 | | 0 |Girl -2) | 0 | | 1 | | 0 | | 0 | | 0 | | 0 |Berry -3) | 0 | | 0 | | 0 | | 0 | | 0 | | 1 |Apple -4) | 0 | | 0 | | 0 | | 1 | | 0 | | 0 |Likes -5) | 0 | | 0 | | 1 | | 0 | | 0 | | 0 |The -6) | 1 | | 0 | | 0 | | 0 | | 0 | | 0 |And -7) | 0 | | 0 | | 0 | | 0 | | 1 | | 0 |Boy -8) | 0 | | 0 | | 0 | | 0 | | 0 | | 0 |Book -9) β 0 β β 0 β β 0 β β 0 β β 0 β β 0 β

By representing sequences in this way we can feed our data to neural networksβ¨

If our dictionary consists of 10,000 words so each vector will be 10,000 dimensional π€

This representation can not capture semantic features π

Representing words by associating them with features such as gender, age, royal, food, cost, size.... and so on

Every feature is represented as a range between [-1, 1]

Thus, every word can be represented as a vector of these features

The dimension of each vector is related to the number of features that we pick

For a given word *w*, the embedding matrix *E* is a matrix that maps its 1-hot representation $o_w$ to its embedding $e_w$ as follows:

β$e_w=Eo_w$β

Words that have the

**similar**meaning have a**similar**representation.This model can capture semantic features β¨

Vectors are smaller than vectors in one hot representation.

TODO: Subtracting vectors of oppsite words

Word2vec is a strategy to learn word embeddings by estimating the likelihood that a given word is surrounded by other words.

This is done by making context and target word pairs which further depends on the

**window size**we take.**Window size**: a parameter that looks to the left and right of the context word for as many as window_size words

Creating

Context to Targetpairs withwindow size = 2π

The skip-gram word2vec model is a supervised learning task that learns word embeddings by assessing the likelihood of any given target word *t* happening with a context word *c*. By noting $ΞΈ_{t}$ a parameter associated with *t*, the probability *P(t|c)* is given by:

β$P(t|c)=\frac{exp(\theta^T_te_c)}{\sum_{j=1}^{|V|}exp(\theta^T_je_c)}$β

Remark: summing over the whole vocabulary in the denominator of the softmax part makes this model computationally expensive

The GloVe model, short for global vectors for word representation, is a word embedding technique that uses a co-occurence matrix *X* where each $X_{ij}$ denotes the number of times that a target *i* occurred with a context *j*. Its cost function *J* is as follows:

β$J(\theta)=\frac{1}{2}\sum_{i,j=1}^{|V|}f(X_{ij})(\theta^T_ie_j+b_i+b'_j-log(X_{ij}))^2$β

where *f* is a weighting function such that $X_{ij}=0$ βΉ $f(X_{ij})$ = 0. Given the symmetry that *e* and *ΞΈ* play in this model, the final word embedding e $e^{(final)}_w$ is given by:

β$e^{(final)}_w=\frac{e_w+\theta_w}{2}$β

If this is your first try, you should try to download a pre-trained model that has been made and actually works best.

If you have enough data, you can try to implement one of the available algorithms.

Because word embeddings are very computationally expensive to train, most ML practitioners will load a pre-trained set of embeddings.