Generate Word Embedding With Word2Vec

In the previous session, we have seen the power of word embedding.

Now we will see how to generate word embedding using Word2Vec!

Word2Vec is the most popular word embedding technique, where we train a neural network on a simple two-layer network.

Concept

The concept is pretty simple if we try to delve in into the intuition first:

Our input will start as a one-hot encoded vector. We’ve learned it before: A vector which only has one element as 1 and the rest are 0s. This vector will represent our input word.
This one-hot encoded vector is fully connected to a hidden layer where every individual neuron represents different focused contexts gathered from training. One neuron might be focused on the tense of verbs, while another might focus on gender differences in pronouns, and so on.
The hidden layer is then fully connected to an output layer which uses softmax to produce probabilities for every word in the vocab.

This last step is crucial to understand how Word2Vec understand relations between words

The Intuition

If we read a lot of sentences like:

“the king ordered the citizens to leave the city”
“the ruler ordered the citizens to leave the city”
“the king commanded the citizens to evacuate the city”
“the ruler commanded the citizens to evacuate the city”

We shall see that the words “king” and “ruler” are used in similar contexts, i.e. “king” should be close to “ruler” in the vector space.

“commanded” and “ordered” are also used in similar contexts, so they should be close to each other as well.

Let’s try other examples:

“fish swims in the water”
“the water is home to many fish”
“the fish is dead because of the polluted water”
“water is essential for fish to live”

We see that “fish” and “water” often appear together. Shall we conclude that “fish” and “water” are close to each other in the vector space?

Yes, we can!

So, how can we use this information to build a word embedding?

Relating words

How Word2Vec learning context of a word is that when a word is commonly found near another word, then these two words have a close relationship. For example, the words “fish” and “water” are often found together, so they have a close relationship.

Let’s use the following sentence as an example:

\[\text{"I love to eat fish, but I hate to drink water"}\]

When creating this relation, there are two ways to do it:

Continuous Skip-gram

Predicts words within a certain range before and after a word in a sentence.

So, given “eat” and window size = 2, the skip-gram model will predict “love”, “to”, “fish”, and “but” as the output words. (see 2nd row below)

Window size	Text	Input	Predicted
2	[I _love_ to eat] fish, but I hate to drink water	love	(I, to, eat)
2	I [love to _eat_ fish, but] I hate to drink water	eat	(love, to, fish, but)
2	I love [to eat _fish_ , but I ] to drink water	fish	(to, eat, but, I)
3	I [love to eat _fish_ , but I hate] to drink water	fish	(love, to, eat, but, I, have)

def get_skip_gram_pairs(sentence, window_size):
    skip_gram_pairs = []
    words = sentence.lower().split()
    for i in range(len(words)):
        predicted = []
        for j in range(i - window_size, i + window_size + 1):
            if j == i or j < 0 or j >= len(words):
                continue
            predicted.append(words[j])
        skip_gram_pairs.append([words[i], predicted])
    return skip_gram_pairs

get_skip_gram_pairs("I love to eat fish, but I hate to drink water", 2)

[['i', ['love', 'to']],
 ['love', ['i', 'to', 'eat']],
 ['to', ['i', 'love', 'eat', 'fish,']],
 ['eat', ['love', 'to', 'fish,', 'but']],
 ['fish,', ['to', 'eat', 'but', 'i']],
 ['but', ['eat', 'fish,', 'i', 'hate']],
 ['i', ['fish,', 'but', 'hate', 'to']],
 ['hate', ['but', 'i', 'to', 'drink']],
 ['to', ['i', 'hate', 'drink', 'water']],
 ['drink', ['hate', 'to', 'water']],
 ['water', ['to', 'drink']]]

Continuous Bag of Words (CBOW)

It’s quite the opposite of Skip-gram. It predicts a middle word given the context of a few words before and a few words after the target word.

So, given “love”, “to”, “fish”, “but”, the CBOW model will predict “eat” as the output word

Window size	Text	Input	Predicted
2	[I _love_ to eat] fish, but I hate to drink water	(“I”, “to eat”)	love
2	I [love to _eat_ fish, but] I hate to drink water	(“love to”, “fish, but”)	eat
2	I love [to eat _fish_ , but I ] to drink water	(“to eat”, “but I”)	fish
3	I [love to eat _fish_ , but I hate] to drink water	(“love to eat”, “but I hate”)	fish

def generate_cbow(sentence, window_size):
    words = sentence.split()
    cbow_pairs = []
    for i in range(window_size, len(words) - window_size):
        context_words = []
        for j in range(i - window_size, i + window_size + 1):
            if j == i or j < 0 or j >= len(words):
                continue
            context_words.append(words[j])
        cbow_pairs.append((context_words, words[i]))
    return cbow_pairs

generate_cbow("I love to eat fish, but I hate to drink water", 2)

[(['I', 'love', 'eat', 'fish,'], 'to'),
 (['love', 'to', 'fish,', 'but'], 'eat'),
 (['to', 'eat', 'but', 'I'], 'fish,'),
 (['eat', 'fish,', 'I', 'hate'], 'but'),
 (['fish,', 'but', 'hate', 'to'], 'I'),
 (['but', 'I', 'to', 'drink'], 'hate'),
 (['I', 'hate', 'drink', 'water'], 'to')]

Let’s build a Word2Vec model!

The easiest way

The easiest way to build a Word2Vec model is to use the gensim library.

from gensim.models import Word2Vec

sentences = [['I', 'love', 'to','eat', 'ice', 'cream'],
                ['The', 'ice', 'cream', 'is', 'delicious'],
                ['Ice', 'cream', 'is', 'my', 'favorite'],
                ['Ice', 'is', 'very', 'cold'],
                ['South', 'Africa', 'is', 'the', 'house', 'of', 'various', 'animals'],
                ['The', 'desert', 'is', 'very', 'hot']]
model = Word2Vec(sentences, min_count=1, vector_size=100, window=3)

We know have the word2vec model, yes it’s that simple.

Let’s see how to use it!

Let’s print the vector of the word “ice”

model.wv['ice']

array([ 9.3212293e-05,  3.0770719e-03, -6.8118651e-03, -1.3751196e-03,
        7.6688202e-03,  7.3457859e-03, -3.6729246e-03,  2.6427959e-03,
       -8.3167218e-03,  6.2047895e-03, -4.6373853e-03, -3.1648867e-03,
        9.3104383e-03,  8.7417278e-04,  7.4903117e-03, -6.0727512e-03,
        5.1610614e-03,  9.9233752e-03, -8.4570879e-03, -5.1350184e-03,
       -7.0640068e-03, -4.8623742e-03, -3.7776425e-03, -8.5362354e-03,
        7.9563707e-03, -4.8429691e-03,  8.4230686e-03,  5.2623590e-03,
       -6.5500555e-03,  3.9582876e-03,  5.4709758e-03, -7.4261688e-03,
       -7.4054217e-03, -2.4756726e-03, -8.6249216e-03, -1.5812701e-03,
       -4.0279646e-04,  3.3000994e-03,  1.4431456e-03, -8.8017591e-04,
       -5.5925641e-03,  1.7296794e-03, -8.9829665e-04,  6.7929067e-03,
        3.9731935e-03,  4.5301151e-03,  1.4342893e-03, -2.6998674e-03,
       -4.3663131e-03, -1.0323119e-03,  1.4375548e-03, -2.6464923e-03,
       -7.0737889e-03, -7.8048133e-03, -9.1210250e-03, -5.9340443e-03,
       -1.8465136e-03, -4.3233316e-03, -6.4603114e-03, -3.7162432e-03,
        4.2880280e-03, -3.7385889e-03,  8.3772345e-03,  1.5335697e-03,
       -7.2412803e-03,  9.4334288e-03,  7.6311510e-03,  5.4920013e-03,
       -6.8473201e-03,  5.8228681e-03,  4.0087220e-03,  5.1837498e-03,
        4.2560440e-03,  1.9400261e-03, -3.1702011e-03,  8.3524166e-03,
        9.6113142e-03,  3.7917446e-03, -2.8362276e-03,  7.0220985e-06,
        1.2179716e-03, -8.4580434e-03, -8.2226843e-03, -2.3149964e-04,
        1.2369631e-03, -5.7432777e-03, -4.7246884e-03, -7.3462939e-03,
        8.3279610e-03,  1.2049330e-04, -4.5093168e-03,  5.7014343e-03,
        9.1802459e-03, -4.1006147e-03,  7.9636248e-03,  5.3757117e-03,
        5.8797505e-03,  5.1249505e-04,  8.2120160e-03, -7.0181224e-03],
      dtype=float32)

model.wv.most_similar('ice')

[('very', 0.13172131776809692),
 ('delicious', 0.07499014586210251),
 ('cream', 0.06798356026411057),
 ('favorite', 0.04159315675497055),
 ('to', 0.04135243594646454),
 ('eat', 0.012988785281777382),
 ('I', 0.0066059790551662445),
 ('love', -0.009281391277909279),
 ('Ice', -0.013502932153642178),
 ('my', -0.013687963597476482)]

model.wv.most_similar('Africa')

[('is', 0.21887390315532684),
 ('my', 0.17480239272117615),
 ('hot', 0.16380424797534943),
 ('very', 0.10851778090000153),
 ('various', 0.10759598016738892),
 ('South', 0.06559502333402634),
 ('house', 0.059589654207229614),
 ('cold', 0.0490604005753994),
 ('Ice', 0.04764048010110855),
 ('cream', 0.02233739383518696)]

import urllib.request
urllib.request.urlretrieve("https://gist.githubusercontent.com/phillipj/4944029/raw/75ba2243dd5ec2875f629bf5d79f6c1e4b5a8b46/alice_in_wonderland.txt", "alice.txt")

sentences = []
with open('alice.txt', 'r') as f:
    sentences = f.readlines()
    sentences = [sentence.strip() for sentence in sentences]
    sentences = [sentence for sentence in sentences if sentence != '']
    sentences = [sentence.split() for sentence in sentences]

    # remove punctuation
    sentences = [[word for word in sentence if word.isalpha()] for sentence in sentences]
    # lower case
    sentences = [[word.lower() for word in sentence] for sentence in sentences]

sentences[:8]

[['adventures', 'in', 'wonderland'],
 ['adventures', 'in', 'wonderland'],
 ['lewis', 'carroll'],
 ['the', 'millennium', 'fulcrum', 'edition'],
 ['chapter', 'i'],
 ['down', 'the'],
 ['alice',
  'was',
  'beginning',
  'to',
  'get',
  'very',
  'tired',
  'of',
  'sitting',
  'by',
  'her',
  'sister'],
 ['on',
  'the',
  'and',
  'of',
  'having',
  'nothing',
  'to',
  'once',
  'or',
  'twice',
  'she',
  'had']]

# train sentences to Word2Vec model

model = Word2Vec(sentences, min_count=1, vector_size=100, window=5)

model.wv.most_similar('dark')

[('creatures', 0.8895953297615051),
 ('open', 0.8853762149810791),
 ('william', 0.8853062987327576),
 ('gave', 0.8836493492126465),
 ('extraordinary', 0.8834893703460693),
 ('shook', 0.8826101422309875),
 ('until', 0.8825846910476685),
 ('puzzled', 0.8824530839920044),
 ('half', 0.8820216059684753),
 ('whether', 0.8819820284843445)]

model.wv.most_similar("animal")

[('sit', 0.9368554353713989),
 ('yourself', 0.935146152973175),
 ('every', 0.9333352446556091),
 ('waving', 0.9328792691230774),
 ('walked', 0.9322237372398376),
 ('too', 0.9318441152572632),
 ('hatter', 0.9318233132362366),
 ('hands', 0.9317442178726196),
 ('right', 0.9315378069877625),
 ('go', 0.9311593174934387)]

model.wv.most_similar("book")

[('cook', 0.97368985414505),
 ('keep', 0.9732257723808289),
 ('voice', 0.9731258153915405),
 ('found', 0.9730350375175476),
 ('three', 0.9728507399559021),
 ('made', 0.9726716876029968),
 ('him', 0.9724807739257812),
 ('seen', 0.9724376201629639),
 ('tell', 0.9724341034889221),
 ('rabbit', 0.9724060893058777)]

Our own CBOW model

# Use pytorch to train word2vec model using CBOW

import torch
import torch.nn as nn

class CBOW(nn.Module):
    def __init__(self, vocab_size, embedding_dim, context_size):
        super(CBOW, self).__init__()
        self.embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.linear1 = nn.Linear(context_size * 2 * embedding_dim, 128)
        self.linear2 = nn.Linear(128, vocab_size)

    def forward(self, inputs):
        embeds = self.embeddings(inputs).view((1, -1))
        out = torch.relu(self.linear1(embeds))
        out = self.linear2(out)
        log_probs = torch.log_softmax(out, dim=1)
        return log_probs

import torch.optim as optim

losses = []
loss_function = nn.NLLLoss()

sentences = ['I love to eat ice cream',
    'The ice cream is delicious',
    'Ice cream is my favorite',
    'Ice is very cold',
    'South Africa is the house of various animals',
    'The desert is very hot']


vocab = set()
for sentence in sentences:
    for word in sentence.split():
        vocab.add(word)

word_to_idx = {word: i for i, word in enumerate(vocab)}
idx_to_word = {i: word for i, word in enumerate(vocab)}

print("word_to_idx", word_to_idx)
print("idx_to_word", idx_to_word)

context_size = 2
embedding_dim = 10
vocab_size = len(vocab)

model = CBOW(vocab_size, embedding_dim, context_size)
optimizer = optim.SGD(model.parameters(), lr=0.001)

for epoch in range(100):
    total_loss = 0

    for sentence in sentences:
        cbows = generate_cbow(sentence, context_size)
        for cbow in cbows:
            context, target = cbow
            # print("context", context, "target", target)
            context_idxs = torch.tensor([word_to_idx[w] for w in context], dtype=torch.long)

            model.zero_grad()
            log_probs = model(context_idxs)
            loss = loss_function(log_probs, torch.tensor([word_to_idx[target]], dtype=torch.long))
            loss.backward()
            optimizer.step()

            total_loss += loss.item()
    
    losses.append(total_loss)
    if epoch % 10 == 0:
        print("Epoch: {}, Loss: {:.4f}".format(epoch + 1, total_loss))


# Predict
def predict(context):
    print("context:", context)
    context_idxs = torch.tensor([word_to_idx[w] for w in context], dtype=torch.long)
    log_probs = model(context_idxs)
    _, predicted = torch.max(log_probs, 1)
    print("predicted:", idx_to_word[predicted.item()])

predict(['I', 'love', 'eat', 'ice'])
predict(['The', 'ice', 'cream', 'delicious'])

word_to_idx {'of': 0, 'the': 1, 'love': 2, 'eat': 3, 'is': 4, 'favorite': 5, 'ice': 6, 'cream': 7, 'South': 8, 'my': 9, 'Ice': 10, 'cold': 11, 'various': 12, 'I': 13, 'animals': 14, 'desert': 15, 'The': 16, 'hot': 17, 'very': 18, 'delicious': 19, 'Africa': 20, 'house': 21, 'to': 22}
idx_to_word {0: 'of', 1: 'the', 2: 'love', 3: 'eat', 4: 'is', 5: 'favorite', 6: 'ice', 7: 'cream', 8: 'South', 9: 'my', 10: 'Ice', 11: 'cold', 12: 'various', 13: 'I', 14: 'animals', 15: 'desert', 16: 'The', 17: 'hot', 18: 'very', 19: 'delicious', 20: 'Africa', 21: 'house', 22: 'to'}
Epoch: 1, Loss: 29.1734
Epoch: 11, Loss: 26.6193
Epoch: 21, Loss: 24.1955
Epoch: 31, Loss: 21.8820
Epoch: 41, Loss: 19.6662
Epoch: 51, Loss: 17.5518
Epoch: 61, Loss: 15.5483
Epoch: 71, Loss: 13.6769
Epoch: 81, Loss: 11.9517
Epoch: 91, Loss: 10.3868
context: ['I', 'love', 'eat', 'ice']
predicted: to
context: ['The', 'ice', 'cream', 'delicious']
predicted: cream