# πWord Representation

Approaches of word representation

## π Word Representation

This document may contain incorrect info πβΌ Please open a pull request to fix when you find a one π

One Hot Encoding

Featurized Representation (Word Embedding)

Word2Vec

Skip Gram Model

GloVe (Global Vectors for Word Representation)

## π One Hot Encoding

A way to represent words so we can treat with them easily

### π Example

Let's say that we have a dictionary that consists of 10 words (π€) and the words of the dictionary are:

Car, Pen, Girl, Berry, Apple, Likes, The, And, Boy, Book.

Our $$X^{(i)}$$ is: **The Girl Likes Apple And Berry**

So we can represent this sequence like the following π

By representing sequences in this way we can feed our data to neural networksβ¨

### π Disadvantage

If our dictionary consists of 10,000 words so each vector will be 10,000 dimensional π€

This representation can not capture semantic features π

## π Featurized Representation (Word Embedding)

Representing words by associating them with features such as gender, age, royal, food, cost, size.... and so on

Every feature is represented as a range between [-1, 1]

Thus, every word can be represented as a vector of these features

The dimension of each vector is related to the number of features that we pick

**π’ Embedded Matrix**

**π’ Embedded Matrix**

For a given word *w*, the embedding matrix *E* is a matrix that maps its 1-hot representation $$o_w$$ to its embedding $$e_w$$ as follows:

$$e_w=Eo_w$$

**π Advantages**

**π Advantages**

Words that have the

**similar**meaning have a**similar**representation.This model can capture semantic features β¨

Vectors are smaller than vectors in one hot representation.

TODO: Subtracting vectors of oppsite words

### π Word2Vec

Word2vec is a strategy to learn word embeddings by estimating the likelihood that a given word is surrounded by other words.

This is done by making context and target word pairs which further depends on the

**window size**we take.**Window size**: a parameter that looks to the left and right of the context word for as many as window_size words

Creating

Context to Targetpairs withwindow size = 2π

## Skip Gram Model

The skip-gram word2vec model is a supervised learning task that learns word embeddings by assessing the likelihood of any given target word *t* happening with a context word *c*. By noting $$ΞΈ_{t}$$ a parameter associated with *t*, the probability *P(t|c)* is given by:

$$P(t|c)=\frac{exp(\theta^T_te_c)}{\sum_{j=1}^{|V|}exp(\theta^T_je_c)}$$

Remark: summing over the whole vocabulary in the denominator of the softmax part makes this model computationally expensive

**π One Hot Rep. vs Word Embedding**

**π One Hot Rep. vs Word Embedding**

## π§€ GloVe

The GloVe model, short for global vectors for word representation, is a word embedding technique that uses a co-occurence matrix *X* where each $$X_{ij}$$ denotes the number of times that a target *i* occurred with a context *j*. Its cost function *J* is as follows:

$$J(\theta)=\frac{1}{2}\sum_{i,j=1}^{|V|}f(X_{ij})(\theta^T_ie_j+b_i+b'*j-log(X*{ij}))^2$$

where *f* is a weighting function such that $$X_{ij}=0$$ βΉ $$f(X_{ij})$$ = 0. Given the symmetry that *e* and *ΞΈ* play in this model, the final word embedding e $$e^{(final)}_w$$ is given by:

$$e^{(final)}_w=\frac{e_w+\theta_w}{2}$$

## π©βπ« Conclusion of Word Embeddings

If this is your first try, you should try to download a pre-trained model that has been made and actually works best.

If you have enough data, you can try to implement one of the available algorithms.

Because word embeddings are very computationally expensive to train, most ML practitioners will load a pre-trained set of embeddings.

## π§ References

Last updated