One-hot encoded vectors
The way in which words are represented to the computer is in the form of word vectors. One of the simplest forms of word vectors is one-hot encoded vectors. To understand this concept, assume we have a very small vocabulary containing – magic, dragon, king, queen. The one-hot encoded vector for the word dragon would look like this:
Using encodings like these do not capture anything apart from the presence and absence of words in a sentence. But the representations described later on in this post are built on this idea.
Bag of Words (BoW)
Each word or n-gram in a document is linked to a vector index. Depending on the presence or absence of a word we mark the vector index with the count of the number of times it appears in the document. This technique is popularly used in document classification. For example, if we take these two strings as our documents, “The quick brown fox jumped over the lazy dog.” and “The dog woke up and started chasing the fox.” We can form a dictionary like the one shown below.
Then vectors are formed with the count of the word in each document. In this case, we get a 13-element vector that looks like this for each document.
The limitation of this method is that it results in extremely large feature dimensions and sparse vectors. But this model can still be used when you want to create a baseline model in just a few lines of code and when your dataset is small.
TF-IDF
Since BoW just cared about the frequency of words, we take a step forward with TF – IDF. This is a popular algorithm in the space of Natural Language processing and is short for Term Frequency – Inverse Document Frequency. It measures the importance of a word or n-gram to a document in a corpus.
Every word in the document is given a score that proportionally increases when the frequency of the word increases in the document but is offset when it is seen too often in the corpus. TF-IDF values can be used in place of the frequency count in the vectors described above. Since this is a statistical measure, it still doesn’t capture the meaning of words.
Albeit these two techniques helping solve many problems in NLP, they still didn’t capture the true meaning of words.
A famous linguist J. R. Firth said, “The complete meaning of a word is always contextual, and no study of meaning apart from context can be taken seriously.” With this idea as the foundation, word embeddings gave a much-needed boost. One of the easiest ways to learn word embeddings or vectors was to use Neural Networks.
Neural-network-based embeddings
Building on the notion that a word can only completely be defined in the context they are used in; Word Embeddings are an approach to represent a word based on the company it keeps. While building word embeddings, we aim to develop dense vector representations that somehow capture their meaning in the different contexts they were seen in the documents. In the previous approaches, we used only one index of a vector to show either the presence, absence, or count of the word. In the following approaches, we’ll move to a Distributional Representation. This representation uses the entire length or dimensions of the vector to express the context, semantics, and syntactic nature in which the word was seen.
Word2Vec
One of the most popular approaches which were developed to create distributional vectors is called Word2Vec. It was developed by Mikolov et al. in 2013 at Google. They proposed two different approaches, Continuous Bag of Words and Continuous Skip Gram. In order to understand how these two architectures work, let’s work with this example paragraph:
“Harry Potter is by J. K. Rowling, a British author. It is named for its protagonist and hero, the fictional Harry Potter. The seven books in the series have sold over 500 million copies across the world in over 70 languages and is the best-selling book series of all time. All of them have been made into movies.“
Continuous Bag of Words (CBOW)
This architecture aims to predict the current word based on the input context. Imagine a sliding window that moves from one word to the next in the above paragraph. When the window is over the word “fantasy” the words which precede it and the words which follow the word are considered the context of this word.
The one–hot encoded context word vectors are the input to this model. The number of dimensions of the vectors is the vocabulary size V of the corpus. This model consists of a single hidden layer and an output layer. The training objective for this model is to maximize the conditional probability of the output word. So the weights W1 and W2 are changed till the model can achieve a high conditional probability for the output word.
So, in our example, given the one-hot encoded vectors of four preceding words – “a”, “series”, “of”, “eight” and four following words – “novel”, “and”, “eight”, “movies” the CBOW model shown above is to maximize the conditional probability of getting “fantasy” in the output layer. It should be noted that the order in which the context words are fed to the network does not matter.
Continuous Skip Grams
This model is the opposite of the CBOW model. Here, given a word, we want to predict the context in which it is usually seen. Here’s how the architecture would look like:
The model will output the C number of V dimensional vectors. C is defined as the number of context words which we want the model to return and V, as in the previous model is total vocabulary size. The model here is trained to minimize the summed prediction error. So, when our input to the model is “fantasy”, we expect it to return as many vectors for words that you usually find “fantasy” in. It should be noted that the model gives better word vectors when we increase C.
CBOW is simpler and faster to train but the Continuous Skip Gram model performs better with words that are infrequent. Even though we train these two models, we don’t make use of them. What we are interested in are the weights W which the models end with. The weights represented by the weight matrix W2 then become our Word Vectors. These word vectors can then be used to initialize neural networks trained to perform different tasks.
Word2Vec wasn’t the model to come up with an architecture to generate continuous distributional word vectors but they were the first to reduce the computational complexity which comes with such a task.
If you are interested in learning in-depth about the Word2Vec model, here are the papers which introduced them to the world:
- Efficient Estimation of Word Representations in Vector Space – Mikolov et al. 2013
This paper talks about the architectures discussed above – CBOW and Skip Gram models. - Distributed Representations of Words and Phrases and their Compositionality – Mikolov et al. 2013
This paper shows the power of word vectors and different techniques that can be used to optimize them using techniques like negative sampling, softmax hierarchy, etc.
- Linguistic Regularities in Continuous Space Word Representations – Mikolov et al. 2013
Reading this paper would show you the reasoning capabilities of these vectors and the ways they can be used.
This is also where the famous example “King – Man + Woman = Queen” is introduced.
The quality of the word vectors depends on the corpus which is used to train them. Word2Vec was initially trained on the Google News dataset on 1.6 billion high-frequency words. The word vectors hence learn the context in which words are used in the world of news reporting.
In the next post, we’ll discuss other ways to generate word vectors using GloVe and FastText. We’ll also take a look at the latest language representations built using models like ELMO and BERT.