word_embedding – Mastering AI Bootcamp

Let’s step back a bit to understand the intuition of Word Embedding.

Recall the question we asked in the previous section: how can we represent a word in a way that a computer can understand?

ASCII

If you know a bit about computer science, you might have heard of ASCII. ASCII is a character encoding standard for electronic communication. It assigns a number to each character in the alphabet, and it is a way for computers to store and manipulate text.

import pandas as pd

df = pd.DataFrame()
for i in range(32, 127):
    df = df.append({'ascii': i, 'char': chr(i)}, ignore_index=True)
df.head(50)

	ascii	char
0	32.0
1	33.0	!
2	34.0	"
3	35.0	#
4	36.0	$
5	37.0	%
6	38.0	&
7	39.0	'
8	40.0	(
9	41.0	)
10	42.0	*
11	43.0	+
12	44.0	,
13	45.0	-
14	46.0	.
15	47.0	/
16	48.0	0
17	49.0	1
18	50.0	2
19	51.0	3
20	52.0	4
21	53.0	5
22	54.0	6
23	55.0	7
24	56.0	8
25	57.0	9
26	58.0	:
27	59.0	;
28	60.0	<
29	61.0	=
30	62.0	>
31	63.0	?
32	64.0	@
33	65.0	A
34	66.0	B
35	67.0	C
36	68.0	D
37	69.0	E
38	70.0	F
39	71.0	G
40	72.0	H
41	73.0	I
42	74.0	J
43	75.0	K
44	76.0	L
45	77.0	M
46	78.0	N
47	79.0	O
48	80.0	P
49	81.0	Q

# convert "Hello" to ascii array
ascii_array = [ord(c) for c in "Hello"]
ascii_array

[72, 101, 108, 108, 111]

So, “Hello” is stored as vector \[[72, 101, 108, 108, 111]\] in ASCII.

Problems with ASCII for NLP

ASCII is a very simple way to represent text. However, it has a few problems:

It does not capture the meaning of words. For example, “good” and “great” are very similar in meaning, but they are represented by very different vectors in ASCII.
It does not capture the relationship between words. For example, “good” and “bad” are antonyms, but they are represented by very different vectors in ASCII.

One hot encoding

Another traditional ways to represent words as vectors is to use a one hot encoding, and we’ll learn how word embeddings are an improvement to this approach. A one hot encoding is a vector that is 0 in all dimensions except 1 in a single dimension. In this case, the dimensionality of the vector is the same as the number of words in the vocabulary.

Let’s see an example. Say we have the following sentence:

\[ \text{sentence} = \text{"the cat sat on the mat"} \]

The vocabulary of this sentence is:

\[ \text{vocabulary} = \{ \text{"the"}, \text{"cat"}, \text{"sat"}, \text{"on"}, \text{"mat"} \} \] (notice that “the” appears twice in the sentence, but it is only counted once in the vocabulary)

To represent each word we would have a vector of length 5 (the size of the vocabulary) and just take the index of the word in the vocabulary and set that index to 1 and the rest to 0.

\[ \text{"cat"} = [0, 1, 0, 0, 0] \]

\[ \text{"mat"} = [0, 0, 0, 0, 1] \]

\[ \text{"the"} = [1, 0, 0, 0, 0] \]

As you can see, the concept of one hot encoding is very simple: A word would only be represented by a 1 in the position of its index in the vocabulary and 0 in all other positions.

One hot encoding in Python

Let’s try to implement this in Python. First we need to tokenize the sentence. We can use the split function to split the sentence into words.

import numpy as np

def create_encoding_dict(vocabs: set) -> dict:
    result = {}
    for i, word in enumerate(vocabs):
        result[word] = np.zeros(len(vocabs))
        result[word][i] = 1
    
    return result

create_encoding_dict({'hello', 'world', 'good', 'morning'})

{'morning': array([1., 0., 0., 0.]),
 'world': array([0., 1., 0., 0.]),
 'hello': array([0., 0., 1., 0.]),
 'good': array([0., 0., 0., 1.])}

That’s it, now we have the mapping dictionary. Let’s use it to encode the sentence.

Encode a sentence

Now, we know how to encode a given vocabs.

That encoding can be used to encode a sentence.

encoding_dict = create_encoding_dict({'hello', 'world', 'good', 'morning', 'afternoon', 'night'})
encoding_dict

{'night': array([1., 0., 0., 0., 0., 0.]),
 'hello': array([0., 1., 0., 0., 0., 0.]),
 'good': array([0., 0., 1., 0., 0., 0.]),
 'afternoon': array([0., 0., 0., 1., 0., 0.]),
 'world': array([0., 0., 0., 0., 1., 0.]),
 'morning': array([0., 0., 0., 0., 0., 1.])}

def encode_sentence(sentence: str, encoding_dict: dict) -> list:
    result = []
    for word in sentence.split():
        result.append(encoding_dict[word])
    return result

encoded_sentence = encode_sentence('hello world', encoding_dict)
encoded_sentence

[array([0., 1., 0., 0., 0., 0.]), array([0., 0., 0., 0., 1., 0.])]

So $\text{"hello world"}$ is encoded as matrix

\[ \begin{bmatrix} 0 & 1 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 1 & 0 \\ \end{bmatrix} \]

encode_sentence("hello good morning", encoding_dict)

[array([0., 1., 0., 0., 0., 0.]),
 array([0., 0., 1., 0., 0., 0.]),
 array([0., 0., 0., 0., 0., 1.])]

So $\text{"hello good morning"}$ is encoded as matrix

\[ \begin{bmatrix} 0 & 1 & 0 & 0 & 0 & 0 \\ 0 & 0 & 1 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 1 \\ \end{bmatrix} \]

Decode a sentence

But how to decode a sentence?

Easy, we just need to reverse the encoding.

def decode_sentence(sentence: list, encoding_dict: dict) -> str:
    result = []
    for word in sentence:
        result.append(list(encoding_dict.keys())[np.argmax(word)])
    return ' '.join(result)

encoded_sentence = encode_sentence("hello good morning", encoding_dict)
decode_sentence(encoded_sentence, encoding_dict)

'hello good morning'

The problem of one hot encoding

One hot encoding is a very simple way to encode words. However, it has a few problems:

The dimensionality of the vectors is the same as the number of words in the vocabulary.

If the vocabulary is 10,000 words then each word is represented by a vector of length 10,000. If we have a 10,000 vocabs, to encode “hello” we need to have a vector of 10,000 elements, and only one of them is 1.

\[ encode("hello") = [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, ..., 0, 0, 0, 0, 0] \]

You can see also how “wasteful” this is. Most of the elements in the vector are zeros, which is the point of the next problem.
The vectors are very sparse: Most of the elements in the vector are zeros.

We can see from the above example, only one element is 1, and the rest are 0. This is very wasteful, and it’s also very hard to train a model on such sparse vectors because, as we’ve learned, it will be very resistant to the curse of dimensionality.
The vectors are independent. There is no relationship between the vectors

\[ encode("king") = [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, ..., 0, 0, 0, 0, 0] \\ encode("raja") = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..., 0, 0, 0, 0, 1] \]

Using basic one hot encoding, we can’t tell that “king” and “ruler” are similar, and so when we train our model on a sentence such as:

“The king conquered the land”

We can’t use that learning to complete the sentence:

“The ruler conquered the …”

because the model doesn’t know that “king” and “ruler” are similar.

Word Embedding Superiority

We’ve learned word embedding before, but let’s try to understand why it’s superior to one hot encoding.

The vectors are smaller, a lot smaller.
The vectors are dense.
The vectors are related. There is a relationship between the vectors

Word embeddings have the advantage of capturing semantic relationships between words. In a trained word embedding model, words that are semantically related have similar representations—i.e., their vectors are closer to each other in the vector space. This allows the model to capture patterns like:

Synonyms (words with similar meanings have similar vectors)
Analogies (relationships like ‘king’ to ‘queen’ as ‘man’ to ‘woman’)
Thematic relationships (e.g., ‘Paris’, ‘Rome’ and ‘Berlin’ might all considered similar, reflecting their shared characteristic as European capitals).
And more

Vector

Before we dive into word embedding, let’s first understand what a vector is.

Vector is a mathematical object that has both a magnitude and a direction. For example, the velocity of a moving object is a vector. It has a magnitude (the speed of the object) and a direction (the direction in which the object is moving).

# Draw a vector of (2, 5)

import matplotlib.pyplot as plt
import numpy as np

plt.quiver(0, 0, 2, 5, angles='xy', scale_units='xy', scale=1)
plt.xlim(-10, 10)
plt.ylim(-10, 10)

plt.text(1, 5, r'$\vec{v}$', size=18)
plt.show()

Remember, it has both a magnitude and a direction.

Let’s draw another vector.

# Draw 2 vectors with the same magnitude but different directions

plt.quiver(0, 0, 2, 5, angles='xy', scale_units='xy', scale=1, color='blue')
plt.text(1, 5, r'$\vec{v}$', size=18, color='blue')
plt.quiver(0, 0, 3, -5, angles='xy', scale_units='xy', scale=1, color='red')
plt.text(3, -6, r'$\vec{w}$', size=18, color='red')

plt.xlim(-10, 10)
plt.ylim(-10, 10)

plt.show()

Mathematically, vector can be represented as a list of numbers (or with vector notation).

Write those two vector in Latex

\[ \vec{v} = \begin{bmatrix} 2 \\ 5 \end{bmatrix} \] \[ \vec{w} = \begin{bmatrix} 3 \\ -5 \end{bmatrix} \]

Those vectors have different magnitudes and directions.

The first element of the vector is the magnitude of the vector in the x direction, and the second element is the magnitude of the vector in the y direction.

Let’s break down the vector $\vec{v}$.

import matplotlib.pyplot as plt

plt.quiver(0, 0, 2, 5, angles='xy', scale_units='xy', scale=1, color='blue')
plt.text(1, 5, r'$\vec{v}$', size=18, color='blue')

# break it down into two vectors: x and y
plt.quiver(0, 0, 2, 0, angles='xy', scale_units='xy', scale=1, color='red')
plt.text(1, -1, r'$\vec{x}$', size=18, color='red')
plt.quiver(2, 0, 0, 5, angles='xy', scale_units='xy', scale=1, color='green')
plt.text(2.5, 2.5, r'$\vec{y}$', size=18, color='green')

plt.xlim(-10, 10)
plt.ylim(-10, 10)

plt.show()

\[ \vec{v} = \begin{bmatrix} 2 \\ 5 \end{bmatrix} \]

is actually the sum of two vectors:

\[ \vec{x} = \begin{bmatrix} 2 \\ 0 \end{bmatrix} \]

and

\[ \vec{y} = \begin{bmatrix} 0 \\ 5 \end{bmatrix} \]

Vector addition

So to add two vectors, we just need to add the elements of the vectors.

\[ \vec{v} = \begin{bmatrix} 2 \\ 5 \end{bmatrix} = \begin{bmatrix} 2 \\ 0 \end{bmatrix} + \begin{bmatrix} 0 \\ 5 \end{bmatrix} = \vec{x} + \vec{y} \]

Let’s try another example:

\[ \vec{v} = \begin{bmatrix} 2 \\ 5 \end{bmatrix} + \begin{bmatrix} 3 \\ -5 \end{bmatrix} = \begin{bmatrix} 5 \\ 0 \end{bmatrix} \]

## Draw two vectors and their sum

import matplotlib.pyplot as plt

plt.quiver(0, 0, 2, 5, angles='xy', scale_units='xy', scale=1, color='blue')
plt.text(1, 5, r'$\vec{v}$', size=18, color='blue')

plt.quiver(0, 0, 3, -4, angles='xy', scale_units='xy', scale=1, color='orange')
plt.text(3, -6, r'$\vec{w}$', size=18, color='orange')

plt.quiver(0, 0, 5, 1, angles='xy', scale_units='xy', scale=1, color='green')
plt.text(2.5, -1, r'$\vec{v} + \vec{w}$', size=18, color='green')

plt.xlim(-10, 10)
plt.ylim(-10, 10)

plt.show()

Vector magnitude

Recall that vector has both a magnitude and a direction.

So, how do we calculate the magnitude of a vector?

The magnitude of a vector is the length of the vector.

import matplotlib.pyplot as plt

plt.quiver(0, 0, 2, 5, angles='xy', scale_units='xy', scale=1)
plt.text(1, 5, r'$\vec{v}$', size=18)

plt.xlim(-10, 10)
plt.ylim(-10, 10)

plt.show()

How to calculate that length?

Remember Pythagorean theorem? :)

We just need to break down the vector into its components, and then use Pythagorean theorem to calculate the length.

\[ \vec{v} = \begin{bmatrix} 2 \\ 5 \end{bmatrix} \]

The magnitude is \[ \sqrt{2^2 + 5^2} = \sqrt{4 + 25} = \sqrt{29} \]

We can write the magnitude using $||\vec{v}||$. notation

\[ ||\vec{v}|| = \sqrt{2^2 + 5^2} = \sqrt{4 + 25} = \sqrt{29} \]

Vector dot product

Vector dot product is a way to multiply vectors together.

The dot product of two vectors is the sum of the products of the corresponding elements of the two vectors.

\[ \vec{v} = \begin{bmatrix} 2 \\ 5 \end{bmatrix} \]

\[ \vec{w} = \begin{bmatrix} 3 \\ -5 \end{bmatrix} \]

\[ \vec{v} \cdot \vec{w} = 2 \times 3 + 5 \times (-5) = -19 \]

Formally, the dot product is defined as:

\[ \vec{v} \cdot \vec{w} = \sum_{i=1}^{n} v_i w_i \]

The output is a scalar.

So what is the use of dot product?

One of the use is to calculate the angle between two vectors.

Finding the angle between two vectors

We can use the dot product to find the angle between two vectors.

## Draw two vectors and their difference

import matplotlib.pyplot as plt

plt.quiver(0, 0, 2, 5, angles='xy', scale_units='xy', scale=1, color='blue')
plt.text(2, 5, r'$\vec{v}$', size=18, color='blue')

plt.quiver(0, 0, 3, -4, angles='xy', scale_units='xy', scale=1, color='orange')
plt.text(3, -4, r'$\vec{w}$', size=18, color='orange')

# draw arc line for the angle between two vectors
from matplotlib.patches import Arc
plt.text(0.3, 0, r'$\theta$', size=18, color='green')

plt.xlim(-10, 10)
plt.ylim(-10, 10)

plt.show()

How to calculate $\theta$?

Remember, the dot product equation?

It’s in fact can be used to calculate the angle between two vectors.

\[ \vec{v} \cdot \vec{w} = ||\vec{v}|| \times ||\vec{w}|| \times \cos(\theta) \]

We omit the derivation here, but you can find it in the wikipedia page.

From the above equation, we can calculate the angle between two vectors.

\[ \theta = \cos^{-1} \left( \frac{\vec{v} \cdot \vec{w}}{||\vec{v}|| \times ||\vec{w}||} \right) \]

Usually we don’t need to calculate the angle directly, but we can use the dot product to calculate the cosine similarity between two vectors.

\[ \cos{\theta} = \frac{\vec{v} \cdot \vec{w}}{||\vec{v}|| \times ||\vec{w}||} \]

When two vectors are similar, the angle between them is small, and the cosine similarity is 1.
When two vectors are dissimilar, the angle between them is large, and the cosine similarity is 0.
When two vectors are opposite, the angle between them is 180 degrees, and the cosine similarity is -1.

Why do we care about cosine similarity? We will see later in Word Embedding :)

3D Vector

So far we’ve only seen 2D vectors. But what about 3D vectors?

import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
import numpy as np

fig = plt.figure(figsize=(10, 10))
ax = fig.add_subplot(111, projection='3d')

ax.quiver(0, 0, 0, 1, 2, 5, arrow_length_ratio=0.1, colors='black')
ax.text(1, 2, 3, r'$\vec{v}$', size=18)

ax.set_xlim(-10, 10)
ax.set_ylim(-10, 10)
ax.set_zlim(-10, 10)

plt.show()

It’s hard to visualize 3D vectors, but the concept is the same:

We have the magnitude and the direction.

\[ \vec{v} = \begin{bmatrix} 1 \\ 2 \\ 5 \end{bmatrix} \]

The magnitude is \[ ||\vec{v}|| = \sqrt{1^2 + 2^2 + 5^2} = \sqrt{1 + 4 + 25} = \sqrt{30} \]

Multidimensional Vector

We can also have vectors with more than 3 dimensions.

\[ \vec{v} = \begin{bmatrix} 1 \\ 2 \\ 5 \\ 3 \\ 4 \\ 5 \\ 6 \\ 7 \\ 8 \\ 9 \\ 10 \end{bmatrix} \]

It has 10 dimensions. But - since we live in a 3D world - it’s hard to visualize vectors with more than 3 dimensions ヽ（・＿・；)ノ

What is Word Embedding?

Let’s get back to word embedding.

Word embedding is a way to represent words as vectors

So, word $\text{"cat"}$ could be represented as vector $\vec{v}$

\[ \vec{v} = \begin{bmatrix} 1.5 \\ 2.2 \\ 5.5 \\ ... \\ 2.5 \end{bmatrix} \]

and word $\text{"dog"}$ could be represented as vector $\vec{w}$

\[ \vec{w} = \begin{bmatrix} 2.5 \\ 3.2 \\ 4.5 \\ ... \\ 1.5 \end{bmatrix} \]

Yes, it’s just a vector and it’s multidimensional.

# URL of GloVe embeddings and the path - replace with your actual URL
url = 'http://nlp.stanford.edu/data/glove.6B.zip'
path = 'glove.6B.300d.txt'

# download the url above if not exist
import os
import urllib.request
if not os.path.exists(path):
    import urllib.request
    urllib.request.urlretrieve(url, path)

# load into GenSim
import gensim
wv = gensim.models.KeyedVectors.load_word2vec_format(path, binary=False, no_header=True)
wv

<gensim.models.keyedvectors.KeyedVectors at 0x122ba0550>

wv['cat']

array([-3.3712e-01, -2.1691e-01, -6.6365e-03, -4.1625e-01, -1.2555e+00,
       -2.8466e-02, -7.2195e-01, -5.2887e-01,  7.2085e-03,  3.1997e-01,
        2.9425e-02, -1.3236e-02,  4.3511e-01,  2.5716e-01,  3.8995e-01,
       -1.1968e-01,  1.5035e-01,  4.4762e-01,  2.8407e-01,  4.9339e-01,
        6.2826e-01,  2.2888e-01, -4.0385e-01,  2.7364e-02,  7.3679e-03,
        1.3995e-01,  2.3346e-01,  6.8122e-02,  4.8422e-01, -1.9578e-02,
       -5.4751e-01, -5.4983e-01, -3.4091e-02,  8.0017e-03, -4.3065e-01,
       -1.8969e-02, -8.5670e-02, -8.1123e-01, -2.1080e-01,  3.7784e-01,
       -3.5046e-01,  1.3684e-01, -5.5661e-01,  1.6835e-01, -2.2952e-01,
       -1.6184e-01,  6.7345e-01, -4.6597e-01, -3.1834e-02, -2.6037e-01,
       -1.7797e-01,  1.9436e-02,  1.0727e-01,  6.6534e-01, -3.4836e-01,
        4.7833e-02,  1.6440e-01,  1.4088e-01,  1.9204e-01, -3.5009e-01,
        2.6236e-01,  1.7626e-01, -3.1367e-01,  1.1709e-01,  2.0378e-01,
        6.1775e-01,  4.9075e-01, -7.5210e-02, -1.1815e-01,  1.8685e-01,
        4.0679e-01,  2.8319e-01, -1.6290e-01,  3.8388e-02,  4.3794e-01,
        8.8224e-02,  5.9046e-01, -5.3515e-02,  3.8819e-02,  1.8202e-01,
       -2.7599e-01,  3.9474e-01, -2.0499e-01,  1.7411e-01,  1.0315e-01,
        2.5117e-01, -3.6542e-01,  3.6528e-01,  2.2448e-01, -9.7551e-01,
        9.4505e-02, -1.7859e-01, -3.0688e-01, -5.8633e-01, -1.8526e-01,
        3.9565e-02, -4.2309e-01, -1.5715e-01,  2.0401e-01,  1.6906e-01,
        3.4465e-01, -4.2262e-01,  1.9553e-01,  5.9454e-01, -3.0531e-01,
       -1.0633e-01, -1.9055e-01, -5.8544e-01,  2.1357e-01,  3.8414e-01,
        9.1499e-02,  3.8353e-01,  2.9075e-01,  2.4519e-02,  2.8440e-01,
        6.3715e-02, -1.5483e-01,  4.0031e-01,  3.1543e-01, -3.7128e-02,
        6.3363e-02, -2.7090e-01,  2.5160e-01,  4.7105e-01,  4.9556e-01,
       -3.6401e-01,  1.0370e-01,  4.6076e-02,  1.6565e-01, -2.9024e-01,
       -6.6949e-02, -3.0881e-01,  4.8263e-01,  3.0972e-01, -1.1145e-01,
       -1.0329e-01,  2.8585e-02, -1.3579e-01,  5.2924e-01, -1.4077e-01,
        9.1763e-02,  1.3127e-01, -2.0944e-01,  2.2327e-02, -7.7692e-02,
        7.7934e-02, -3.3067e-02,  1.1680e-01,  3.2029e-01,  3.7749e-01,
       -7.5679e-01, -1.5944e-01,  1.4964e-01,  4.2253e-01,  2.8136e-03,
        2.1328e-01,  8.6776e-02, -5.2704e-02, -4.0859e-01, -1.1774e-01,
        9.0621e-02, -2.3794e-01, -1.8326e-01,  1.3115e-01, -5.5949e-01,
        9.2071e-02, -3.9504e-02,  1.3334e-01,  4.9632e-01,  2.8733e-01,
       -1.8544e-01,  2.4618e-02, -4.2826e-01,  7.4148e-02,  7.6584e-04,
        2.3950e-01,  2.2615e-01,  5.5166e-02, -7.5096e-02, -2.2308e-01,
        2.3775e-01, -4.5455e-01,  2.6564e-01, -1.5137e-01, -2.4146e-01,
       -2.4736e-01,  5.5214e-01,  2.6819e-01,  4.8831e-01, -1.3423e-01,
       -1.5918e-01,  3.7606e-01, -1.9834e-01,  1.6699e-01, -1.5368e-01,
        2.4561e-01, -9.2506e-02, -3.0257e-01, -2.9493e-01, -7.4917e-01,
        1.0567e+00,  3.7971e-01,  6.9314e-01, -3.1672e-02,  2.1588e-01,
       -4.0739e-01, -1.5264e-01,  3.2296e-01, -1.2999e-01, -5.0129e-01,
       -4.4231e-01,  1.6904e-02, -1.1459e-02,  7.2293e-03,  1.1026e-01,
        2.1568e-01, -3.2373e-01, -3.7292e-01, -9.2456e-03, -2.6769e-01,
        3.9066e-01,  3.5742e-01, -6.0632e-02,  6.7966e-02,  3.3830e-01,
        6.5747e-02,  1.5794e-01,  4.7155e-02,  2.3682e-01, -9.1370e-02,
        6.4649e-01, -2.5491e-01, -6.7940e-01, -6.9752e-01, -1.0145e-01,
       -3.6255e-01,  3.6967e-01, -4.1295e-01,  8.2724e-02, -3.5053e-01,
       -1.7564e-01,  8.5095e-02, -5.7724e-01,  5.0252e-01,  5.2180e-01,
        5.7327e-02, -7.9754e-01, -3.7770e-01,  7.8149e-01,  2.4597e-01,
        6.0672e-01, -2.0082e-01, -3.8792e-01,  4.1295e-01, -1.6143e-01,
        1.0427e-02,  4.3197e-01,  4.6297e-03,  2.1185e-01, -2.6606e-01,
       -5.8740e-02, -5.1003e-01,  2.8524e-01,  1.3627e-02, -2.7346e-01,
        6.1848e-02, -5.7901e-01, -5.1136e-01,  3.6382e-01,  3.5144e-01,
       -1.6501e-01, -4.6041e-01, -6.4742e-02, -6.8310e-01, -4.7427e-02,
        1.5861e-01, -4.7288e-01,  3.3968e-01,  1.2092e-03,  1.6018e-01,
       -5.8024e-01,  1.4556e-01, -9.1317e-01, -3.7592e-01, -3.2950e-01,
        5.3465e-01,  1.8224e-01, -5.2265e-01, -2.6209e-01, -4.2458e-01,
       -1.8034e-01,  9.9502e-02, -1.5114e-01, -6.6731e-01,  2.4483e-01,
       -5.6630e-01,  3.3843e-01,  4.0558e-01,  1.8073e-01,  6.4250e-01],
      dtype=float32)

That’s the vector of the word “cat”. How many dimensions does it have?

wv['cat'].shape

(300,)

How about tiger?

wv['tiger']

array([ 3.1805e-01,  3.8612e-01,  1.0725e-01,  2.8261e-01, -4.4965e-02,
        1.0612e-02,  4.3426e-01,  1.1006e+00,  1.5124e-01, -7.5199e-01,
        5.4254e-01, -2.5544e-01, -1.6400e-01,  1.6128e-01, -1.7060e-02,
       -2.2410e-01,  1.2682e-01,  8.4087e-01, -2.7631e-01,  4.4310e-02,
        2.6123e-01, -3.8948e-02, -1.4925e-01, -6.0481e-01, -1.1059e+00,
       -1.1135e-01, -5.9403e-02, -2.2909e-01,  6.7889e-01,  1.8288e-01,
        6.9610e-02, -1.3831e+00,  5.7360e-02, -3.3441e-01, -2.6577e-01,
       -3.4069e-01,  1.7086e-01,  5.9148e-01, -8.3631e-01,  4.8743e-01,
        2.4388e-01, -4.2785e-01,  3.9639e-01, -1.8224e-01, -3.1574e-01,
       -4.1929e-01,  4.3294e-01, -3.1500e-01, -2.3390e-01, -9.5833e-03,
        9.6671e-01, -1.8473e-01,  1.5179e-01,  3.5956e-01, -5.4430e-02,
        2.4032e-01, -1.7691e-02,  1.0346e+00, -2.3621e-01, -4.6284e-02,
       -6.3183e-01, -2.6131e-01,  2.2495e-01,  6.5933e-01,  9.7632e-02,
       -1.4428e-01, -5.1098e-01, -6.4340e-01,  2.2279e-01,  4.7017e-01,
        6.8450e-02,  3.4013e-01,  5.0337e-02,  2.4793e-02, -1.3726e-02,
        2.2475e-01,  5.3832e-01, -6.0123e-01,  2.4434e-01,  3.5062e-01,
        1.4058e-01, -2.4493e-01,  1.0419e-01, -4.9484e-01,  3.2629e-01,
       -6.3158e-01, -8.1345e-01, -3.5166e-02,  2.3366e-01, -2.6914e-01,
       -1.5012e-02, -9.1551e-02, -3.5173e-01,  3.2239e-02,  4.4592e-01,
        3.2620e-01,  2.0522e-01, -4.6460e-01, -5.6885e-01, -4.6610e-01,
        9.7700e-02,  3.7100e-01,  5.5914e-01,  3.3662e-01,  5.5955e-01,
       -2.6679e-02,  2.0628e-02,  1.6219e-01, -7.9194e-02,  2.1179e-01,
       -8.1665e-02,  9.6164e-02, -8.2605e-01, -1.6286e-01,  9.8834e-02,
        8.9302e-02,  1.8239e-01,  5.2664e-01, -6.4723e-01, -2.7549e-01,
       -1.0490e+00, -5.9390e-01, -2.0139e-01,  5.8160e-01, -2.9544e-01,
       -2.3005e-01,  1.9733e-01,  3.1993e-01,  4.0029e-02, -4.2565e-01,
       -2.6076e-01,  2.8575e-01, -1.0009e-01, -2.0921e-02, -1.6854e-01,
        2.3219e-01,  1.2139e-02, -1.6396e-01, -2.2856e-01,  3.1307e-01,
        4.4448e-02,  2.7773e-02,  4.2594e-01, -4.2870e-01, -3.9471e-01,
       -4.0541e-01,  6.3980e-02,  5.5482e-01,  4.8681e-02, -2.6031e-01,
       -2.5607e-01, -6.8518e-02, -2.1721e-01, -2.1251e-01,  4.8762e-01,
       -4.8040e-01, -3.9291e-01, -2.6542e-01,  3.0685e-01,  1.3879e+00,
        3.1551e-01,  3.7906e-01,  4.8222e-02, -2.6885e-01, -6.7447e-01,
       -5.8845e-02,  1.4894e-02,  3.1901e-01, -8.5330e-01, -5.3998e-01,
       -6.5799e-01, -2.2534e-01, -9.9321e-02,  7.0429e-01,  1.1255e-01,
       -2.8921e-01,  1.4851e-01,  2.5052e-01,  5.3332e-01, -1.3101e-02,
        6.0388e-02, -1.5534e-01, -1.6244e-01,  1.8062e-01, -2.5835e-03,
        1.0043e-01, -1.4026e-01,  2.9892e-01,  2.6348e-01, -7.3950e-02,
        3.7324e-01,  1.5981e-01,  2.0407e-01,  1.0287e-01,  6.6057e-02,
        9.6447e-02,  4.9990e-01, -3.2505e-02,  1.1403e-01, -3.0171e-01,
        1.8904e+00, -4.2511e-01, -1.4158e-01, -5.4888e-01, -2.0008e-01,
        3.7909e-01, -6.6070e-01, -2.0747e-01, -3.6918e-01,  6.8243e-02,
        5.6493e-02,  6.7445e-02, -2.1361e-01, -9.9830e-01,  3.2986e-01,
       -5.5691e-01,  1.7576e-01,  3.5422e-01, -6.5196e-02, -1.6417e-02,
        8.3042e-01,  2.6428e-01,  2.9994e-01, -5.1640e-01,  1.2353e-01,
       -4.6543e-01,  3.8272e-01, -3.6424e-01, -3.6278e-01, -5.6585e-01,
        2.3366e-01, -7.2896e-01,  3.5874e-01, -6.2963e-03, -8.0878e-02,
       -1.9360e-01,  2.1159e-01, -9.0342e-02, -7.9771e-01,  3.0855e-01,
       -4.8318e-01,  1.3295e-01,  4.0856e-02,  9.5406e-01, -7.3737e-01,
        4.6077e-01, -8.2662e-02,  2.2545e-01,  2.5722e-01, -5.7956e-01,
       -1.1102e+00, -2.4182e-01, -5.3534e-01, -3.9995e-01,  8.9786e-01,
       -4.7030e-01,  6.8895e-01, -6.4400e-02, -3.0525e-01, -2.4539e-01,
        1.0649e-01,  1.1519e-04,  5.2123e-02, -2.3651e-01, -2.7918e-01,
        1.2954e-01,  1.3222e-01,  4.5636e-01, -1.6590e-01, -1.6413e-01,
        2.1242e-01, -9.8367e-02,  2.7643e-01, -3.5059e-01,  4.2767e-01,
       -3.6123e-01, -5.3538e-01,  1.1485e-01, -2.5226e-01, -1.7993e-01,
       -1.4363e-01,  1.2369e-01,  2.3747e-01,  2.0453e-01, -6.1958e-01,
       -2.3841e-01, -5.7274e-01, -2.2474e-01,  2.1683e-01, -2.8839e-01,
       -3.5495e-01, -4.4978e-01, -7.3920e-01, -1.0828e-01,  6.0953e-03,
        9.4683e-01,  3.6775e-01,  1.4240e-01,  2.5970e-01,  2.5982e-01],
      dtype=float32)

Remember the cosine similarity?

What if we do the dot product between the vector of “cat” and “tiger”?

# cosine similarity of cat and tiger
import numpy as np
np.dot(wv['cat'], wv['tiger']) / (np.linalg.norm(wv['cat']) * np.linalg.norm(wv['tiger']))

0.31289068

It represents the similarity between “cat” and “tiger”. But how to interpret the result?

Let’s calculate the cosine similarity between “cat” and “cat”

np.dot(wv['cat'], wv['cat']) / (np.linalg.norm(wv['cat']) * np.linalg.norm(wv['cat']))

1.0

It’s 1. Remember that cosine similarity is 1 when two vectors are similar.

a = "cat"
b = ["cat", "dog", "woman", "cute", "scary", "computer", "economy", "america", "japan", "pet", "bad"]

# create pandas dataframe, first column: word, second column: cosine similarity
import pandas as pd
df = pd.DataFrame()

for word in b:
    df = df.append({'word': word, 'cosine_similarity': np.dot(wv[a], wv[word]) / (np.linalg.norm(wv[a]) * np.linalg.norm(wv[word]))}, ignore_index=True)

# sort by cosine similarity
df.sort_values(by='cosine_similarity', ascending=False)

	word	cosine_similarity
0	cat	1.000000
1	dog	0.681675
9	pet	0.587037
3	cute	0.339551
2	woman	0.288396
10	bad	0.247380
5	computer	0.204328
4	scary	0.177671
7	america	0.123796
8	japan	0.079736
6	economy	-0.030009

Interesting, right? Now we know why cosine similarity is useful

Oh ya, instead of calculating cosine similarity manually, we can use similarity function from gensim library :)

wv.similarity("king", "ruler")

0.48487008

wv.similarity("king", "queen")

0.63364697

We can also list down the most similar words to a given word using most_similar function.

wv.most_similar("king")

[('queen', 0.6336469054222107),
 ('prince', 0.6196622252464294),
 ('monarch', 0.5899620652198792),
 ('kingdom', 0.5791266560554504),
 ('throne', 0.5606487393379211),
 ('ii', 0.5562329888343811),
 ('iii', 0.5503198504447937),
 ('crown', 0.5224862098693848),
 ('reign', 0.5217353701591492),
 ('kings', 0.5066401362419128)]

Interesting properties of word embedding

Word embedding has some interesting properties.

Let’s see some examples.

\[King - Queen = Man - ...\]

read as: the differences (minus) between King and Queen is similar to the difference between Man and what?

To answer that, just do simple arithmetic.

\[ King - Queen = Man - x\\ x = Man + Queen - King \]

Which can be implemented in Python as:

wv.most_similar(positive=['man', 'queen'], negative=['king'])

[('woman', 0.6957679986953735),
 ('girl', 0.5603842735290527),
 ('person', 0.5134302973747253),
 ('she', 0.4802548587322235),
 ('mother', 0.4633125066757202),
 ('boy', 0.46078377962112427),
 ('lady', 0.45522934198379517),
 ('teenager', 0.45107489824295044),
 ('her', 0.4438043534755707),
 ('men', 0.4426511526107788)]

\[ Strong - Stronger = Weak - ... \]

wv.most_similar(positive=['weak', 'stronger'], negative=['strong'])

[('weaker', 0.790091872215271),
 ('weakened', 0.5707376003265381),
 ('weakening', 0.5706116557121277),
 ('sluggish', 0.5251186490058899),
 ('weaken', 0.5133482217788696),
 ('worse', 0.505226194858551),
 ('weakness', 0.501374363899231),
 ('slower', 0.4901147186756134),
 ('significantly', 0.48916390538215637),
 ('cheaper', 0.4886264503002167)]

wv.most_similar(positive=['fast', 'stronger'], negative=['strong'])

[('faster', 0.7575436234474182),
 ('quicker', 0.6800161004066467),
 ('slower', 0.6144760847091675),
 ('harder', 0.5210117101669312),
 ('easier', 0.5203530192375183),
 ('pace', 0.5148283839225769),
 ('slow', 0.5114564895629883),
 ('cheaper', 0.4829019606113434),
 ('better', 0.4786846339702606),
 ('rapidly', 0.44480517506599426)]

\[ Indonesia - Jakarta = Japan - ... \]

wv.most_similar(positive=['japan', 'jakarta'], negative=['indonesia'])

[('tokyo', 0.8145835995674133),
 ('japanese', 0.655620276927948),
 ('seoul', 0.6100760102272034),
 ('osaka', 0.569068193435669),
 ('kyodo', 0.5087528824806213),
 ('hashimoto', 0.457542359828949),
 ('shimbun', 0.4574349522590637),
 ('manila', 0.45497873425483704),
 ('koizumi', 0.45251786708831787),
 ('nikkei', 0.4397467374801636)]

\[ Japan - Kimono = Indonesia - ... \]

wv.most_similar(positive=['indonesia', 'kimono'], negative=['japan'])

[('batik', 0.55389803647995),
 ('bathrobe', 0.44231539964675903),
 ('robe', 0.42582112550735474),
 ('sashes', 0.4215601980686188),
 ('sarongs', 0.41601523756980896),
 ('balinese', 0.41500627994537354),
 ('frock', 0.41402822732925415),
 ('tunic', 0.41291648149490356),
 ('saree', 0.4119526147842407),
 ('drapes', 0.41082924604415894)]

\[ Eat - Ate = Drink - ... \]

wv.most_similar(positive=['drink', 'ate'], negative=['eat'])

[('drank', 0.8188783526420593),
 ('drinks', 0.7133989334106445),
 ('sipped', 0.5968326330184937),
 ('beer', 0.5672910213470459),
 ('drinking', 0.5617578625679016),
 ('beverage', 0.5377948880195618),
 ('vodka', 0.5361367464065552),
 ('beverages', 0.5348184704780579),
 ('soda', 0.5245417356491089),
 ('sipping', 0.5153992176055908)]

Cool isn’t it?

But how to create embedding? Let’s find out in the next lesson