!pip install gensim
!pip install transformers
Second architecture: Using Word Embedding for sentiment classification
The goal of learning with this second architecture is to understand how word embedding can create semantic relations between word. We’ll use RNN architecture later to utilize word embedding to the fullest. From the intuition to the math, understanding word embedding concept and basic RNN architecture hopefully can cater your thirst on how can a model really understand a sentence.
Word Embedding: Every word has their own data
We have already learned above intuition that word embedding is like “scorecard” for every single word. Utilizing word embedding is really about understanding that every single words can contain their own information, whether it’s about gender, about their grammatical rules, about their meaning, etc.
#@title Download and load the word embedding
#@markdown We separated the download and the loading of the word embedding so you can execute below visualization, similarity calculation, etc faster without having to keep redownloading
import os
import numpy as np
import requests, zipfile, io
def download_and_unzip_embeddings(url, directory):
print(f'Downloading and unzipping embeddings...')
= requests.get(url)
r = zipfile.ZipFile(io.BytesIO(r.content))
z =directory)
z.extractall(path
def load_glove_embeddings(path, url):
# If file doesn't exist, download and unzip it
if not os.path.isfile(path):
'/', 1)[0])
download_and_unzip_embeddings(url, path.rsplit(
with open(path, 'r') as f:
= {}
embeddings for line in f:
= line.split()
values = values[0]
word = np.asarray(values[1:], dtype='float32')
vector = vector
embeddings[word] return embeddings
# URL of GloVe embeddings and the path - replace with your actual URL
= 'http://nlp.stanford.edu/data/glove.6B.zip'
url = 'glove.6B/glove.6B.300d.txt'
path
= load_glove_embeddings(path, url)
embeddings
from gensim.models import Word2Vec, KeyedVectors
from sklearn.manifold import TSNE
import numpy as np
import pandas as pd
import plotly.graph_objects as go
def load_glove_model(glove_input_file):
= KeyedVectors.load_word2vec_format(glove_input_file, binary=False)
glove_model return glove_model
# Convert the GloVe file format to word2vec file format
= 'glove.6B/glove.6B.50d.txt'
glove_input_file = 'glove.6B/glove.6B.50d.txt.word2vec'
word2vec_output_file from gensim.scripts.glove2word2vec import glove2word2vec
glove2word2vec(glove_input_file, word2vec_output_file)= load_glove_model(word2vec_output_file) model
DeprecationWarning:
Call to deprecated `glove2word2vec` (KeyedVectors.load_word2vec_format(.., binary=False, no_header=True) loads GLoVE text vectors.).
Scatterchart for words relation in GloVe word embedding
One of the most popular word embedding dictionary is GloVe, Global Vectors for word representation. GloVe is basically an existing dictionary of word embeddings in English that collecting hundreds of thousands to millions of vocabularies and map every single one of them to their dedicated matrix of word embeddings.
NLP and words: The problem of finding relations
We’ll talk more about how a word can be attached to a context later when we’re learning how to generate a word embedding, but in broad sense a word embedding is generated by learning word relations.
For now let’s see below scatterplot to see a single word and what words that GloVe learned that related to that word.
#@title Scatterplot of word relations
= "king" #@param
word = 50
find_nearest
# The most_similar method and extraction of word vectors is not mentioned.
# You'll have to implement this yourself or use an API like gensim or Spacy that provides this functionality.
= model.most_similar(word, topn=find_nearest)
result = [word for word, similarity in result]
word_labels = [similarity for word, similarity in result]
similarity_scores
word_labels.append(word)= model[word_labels]
word_vectors
# Below part of the code assumes word_labels and word_vectors are correctly fetched and prepared.
= TSNE(n_components=2)
tsne = tsne.fit_transform(word_vectors)
reduced_vectors
= pd.DataFrame(reduced_vectors, columns=["tsne1", "tsne2"])
df 'word'] = word_labels
df['is_input_word'] = (df['word'] == word)
df[
= df[df['is_input_word']]
df_input_word = df[~df['is_input_word']]
df_similar_words
= go.Figure()
fig
fig.add_trace(go.Scatter(=df_similar_words["tsne1"],
x=df_similar_words["tsne2"],
y='markers+text',
mode=dict(
marker=8,
size='rgba(152, 0, 0, .8)',
color
),=df_similar_words['word'],
text='top center',
textposition='Similar words'
name
))
fig.add_trace(go.Scatter(=df_input_word["tsne1"],
x=df_input_word["tsne2"],
y='markers+text',
mode=dict(
marker=12,
size='rgba(0, 152, 0, .8)',
color
),=df_input_word['word'],
text='top center',
textposition='Input word'
name
))
fig.update_layout(=f'2D visualization of word embeddings for "{word}" and its similar words',
title_text=dict(title='t-SNE 1'),
xaxis=dict(title='t-SNE 2'))
yaxis
fig.show()
# Similarity Bar chart only for similar words
= go.Figure(data=[
fig2 =word_labels[:find_nearest], y=similarity_scores,
go.Bar(x=similarity_scores, textposition='auto')
text
])
fig2.update_layout(=f'Bar chart showing similarity scores of top {find_nearest} words similar to {word}', # 'words' is changed to 'word'
title_text=dict(title='Words'),
xaxis=dict(title='Similarity Score'))
yaxis
fig2.show()
As you can see above word “king”, GloVe word embedding create relations from that word to other words that relate to that word:
- “queen” is a term for king’s wife
- “empire”, “kingdom” are terms that talks about king’s territory of power
- “death” are a term that meant that a king can be deceased
And so on. Of course it’s hard to really pin point what exactly the context that GloVe understand for a single word, why exactly does GloVe “think” that a word is related to another word because GloVe word embeddings are created by a neural network - which means that a word embedding is mostly lack of explainability in their creation of relations, we can only guess.
Similarity in context
Before we continue, one term that you should know is that when we check how a word related to another word, the term is mostly referred to “similarity”. Similarity means how similar that the given word to the context for another word, in our given context on NLP similarity mostly means:
- How often two words come together in a sentence near each other
- How two words even when not often come together, mostly paired with similar words, ie: car often comes with the word drive, and bus also comes with the word drive.
Of course we’ll dive in further into the concept later in their dedicated section. Reminder that we use data from Wikipedia to generate word embedding, more text to gather from can give better context and this lack of source can weakened the word embedding accuracy on giving context to each words.
Multiple words relation
We can also try to visualize the word relation for several words at once, so we can know what GloVe word embedding thinks how some words relate to the other.
You can try below demonstration and feel free to play with the input. You might notice that you can add negative words as well to make sure some words won’t be included to your plot.
#@title Multiple words similarity
import plotly.graph_objects as go
= "dog,cat" # @param
words_str = "man" # @param
neg_words_str = 50
find_nearest
# Parse array from comma-delimited string
= words_str.split(',')
words = neg_words_str.split(',')
neg_words
# Filter out empty strings
= [word for word in words if word]
words = [word for word in neg_words if word]
neg_words
# Use the most_similar method to get the top 'find_nearest' words that are most similar to your list of words
= model.most_similar(positive=words, topn=find_nearest, negative=neg_words)
result_positive
= [word for word, similarity in result_positive]
word_labels = [similarity for word, similarity in result_positive]
similarity_scores
# Extend labels and vectors for positive results
word_labels.extend(words)
# Extract vectors for words
= model[word_labels]
word_vectors
# Reduce dimensionality for visualization
= TSNE(n_components=2)
tsne = tsne.fit_transform(word_vectors)
reduced_vectors
# Prepare DataFrame
= pd.DataFrame(reduced_vectors, columns=["tsne1", "tsne2"])
df 'word'] = word_labels
df['is_input_word'] = df['word'].isin(words)
df[
= df[df['is_input_word']]
df_input_word = df[~df['is_input_word']]
df_similar_words
# Word embedding scatter plot
= go.Figure()
fig
# Similar words
fig.add_trace(go.Scatter(=df_similar_words["tsne1"],
x=df_similar_words["tsne2"],
y='markers+text',
mode=dict(
marker=8,
size='rgba(152, 0, 0, .8)',
color
),=df_similar_words['word'],
text='top center',
textposition='Similar words'
name
))
# Input words
fig.add_trace(go.Scatter(=df_input_word["tsne1"],
x=df_input_word["tsne2"],
y='markers+text',
mode=dict(
marker=12,
size='rgba(0, 152, 0, .8)',
color
),=df_input_word['word'],
text='top center',
textposition='Input words'
name
))
fig.update_layout(=f'2D visualization of word embeddings for {words} and their most similar words',
title_text=dict(title='t-SNE 1'),
xaxis=dict(title='t-SNE 2'))
yaxis
fig.show()
# Similarity Bar chart only for similar words
= go.Figure(data=[
fig2 =word_labels[:find_nearest], y=similarity_scores,
go.Bar(x=similarity_scores, textposition='auto')
text
])
fig2.update_layout(=f'Bar chart showing similarity scores of top {find_nearest} words similar to {words}',
title_text=dict(title='Words'),
xaxis=dict(title='Similarity Score'))
yaxis
fig2.show()
Dimensionality Reduction
Dimensionality reduction
Remember that in reality word embedding’s dimension is a lot larger than 2 dimension (our above glove word embedding have 100 dimensions per word), what we did above is called dimentionality reduction using T-SNA. Dimensionality reduction can be simplified for the intuition using below analogy:
We live in 3 dimension world, we often “reduce the dimension” of what we see in nature by taking a photo (the photo is 2 dimension)
Above analogy can help you to understand that: - We can capture higher dimension vector to lower dimension - We will lose lots of information along the way. 2 dimension can’t clearly provide context missing from the our real world such as depth, we can’t see what is behind objects on our photo, etc.
Reducing dimensions == removing context
Remember previous intuition that when we do embedding each dimension is embedded with different kind of context? If we reduce any dimension like previous scatter plot we’ll find that lots of context will be missing.
Curse of dimensionality
The problem in the finding similar data is not as simple as finding your key that were lost in your room. Come curse of dimensionality:
Imagine you’re playing hide-and-seek with your friend in a long straight hallway with doors on either side. Each door leads to a room. Although it might take some time, you have only one direction to go – you can walk one way, then back the other way to check each room systematically. This is the equivalent to a problem of one dimension.
Now, imagine if your friend could be hiding in any room in a single floor of a building, but the floor has a maze of lots of directions to go, not a single straight hallway anymore. Now you have more places to potentially look for your friend because the hiding space is wider and longer. You are dealing with two dimensions in this case (length and width of the building).
Let’s go a step further. Your friend could be in any room of a massive multi-storey building with numerous rooms on each floor. Now, your friend has a lot more possible hiding spots because you’re not only searching across the length and width of the building, but also high and low up the multiple floors. This is an example of a problem with three dimensions (length, width, and height of the building).
The curse of dimensionality creates a complex problem when we want to search for similarity because as you can see above, adding dimension adding multitude of complexity.
This is the reason if you click play on our visualizations above, the value that are near your requested query keeps changing: We (data scientist) found some ways to search similarity quickly, but it’s almost impossible to really know if it’s really the nearest - we just guessing that it’s the most likely to be the most similar, but the computational resource to ensure that it’s really the nearest neighbor is expensive.
This concept is one of the reason why when we talk to ChatGPT we might have different answers from the same query.
Context-aware embedding vs static embedding
The last thing that we’ll learn right now for the intuition on word embedding is the difference of context-aware embeddings and static embeddings.
Context-aware embedding is a word embedding that’s generated per sentence input, for example:
“I love this bat, because its grip is perfect for my baseball match.”
A “bat” can be referred to many context. A static word embedding might be able to contain several context of a “bat”, but context-aware embedding is focused on understanding the whole input, the whole sentence first then giving word embedding per word that is focused on that sentence, so the “bat” here will refer to a baseball bat, and the model won’t consider other context for “bat”.
The example for static word embedding is GloVe, which we already learned. And the example for context-aware embedding is BERT, which we’ll dive in further in it’s dedicated section.
#@title Context-aware embedding demo
from transformers import AutoTokenizer, AutoModel
import torch
from scipy.spatial.distance import cosine
import plotly.graph_objects as go
= AutoTokenizer.from_pretrained('distilbert-base-uncased', use_fast=True)
tokenizer = AutoModel.from_pretrained('distilbert-base-uncased')
bert_model
def embed_text(text):
# Encode text
= tokenizer(text, return_tensors='pt')
inputs # Compute token embeddings
= bert_model(**inputs)
outputs # Retrieve the embeddings of the mean tokens
= outputs.last_hidden_state.mean(dim=1).detach().numpy()
embeddings return embeddings
def calculate_similarity(embedding_1, embedding_2):
# Flatten the embeddings to 1D before comparing using cosine distance
return 1 - cosine(embedding_1.flatten(), embedding_2.flatten())
def plot_comparison(word, compare_1, compare_2, similarity_1, similarity_2):
= go.Figure(data=[
fig =compare_1, x=[word], y=[similarity_1]),
go.Bar(name=compare_2, x=[word], y=[similarity_2])
go.Bar(name
])# Customize aspect
='group')
fig.update_layout(barmode
fig.show()
# Your inputs
#@markdown The text input
= "I love this bat, because its grip is perfect for my baseball match." #@param
text #@markdown the word to compare
= "bat" #@param
word #@markdown context to compare
= "animal bat"#@param
compare_1 = "baseball bat"#@param
compare_2
= embed_text(text + " " + word)
word_embedding = embed_text(compare_1)
compare_1_embedding = embed_text(compare_2)
compare_2_embedding
= calculate_similarity(word_embedding, compare_1_embedding)
similarity_1 = calculate_similarity(word_embedding, compare_2_embedding)
similarity_2
plot_comparison(word, compare_1, compare_2, similarity_1, similarity_2)
Up next: Understanding the math
We’ve already learned how basic machine learning model that doesn’t use deep learning understand how to classify a sentence, and we’re learning the lack of “context” understanding when we’re not using neural netwok. Then we’re learning about the intuition of how word embedding works.