Word embeddings are numerical representations of words or phrases in a high-dimensional vector space, where the geometric relationships between vectors capture semantic and syntactic similarities between the corresponding words. These representations enable machine learning models to understand and process natural language in a more meaningful way.
In traditional NLP approaches, words were represented using sparse one-hot encoded vectors, where each word had a unique index in a large vocabulary. However, this representation lacks the ability to capture the relationships and contextual meaning between words. Word embeddings address this limitation by assigning dense, continuous vector representations to words, allowing for more nuanced and contextual word representations.
Word embeddings are typically learned by training models on large text corpora using techniques like Word2Vec, GloVe, or fastText. These models capture statistical patterns in the text data and generate word embeddings that capture semantic and syntactic relationships. The resulting embeddings can reflect various linguistic properties, such as word similarity, analogies, and even certain linguistic regularities.
While pre-trained word embeddings, such as Word2Vec or GloVe, are available online and can be readily used for various NLP tasks, they are generally trained on generic text data and may not capture domain-specific or task-specific nuances. In such cases, training specific word embeddings on specialized datasets can be beneficial. Customized embeddings trained on specific domains or use cases can better capture the intricacies and context of the targeted data, leading to improved performance in downstream NLP tasks.
Having specific embeddings allows models to better understand the domain-specific language, semantic relationships, and contextual cues present in the data. This enables more accurate and effective natural language processing, sentiment analysis, text classification, machine translation, and other NLP tasks.
In this article, we will explore the process of training your own word embeddings and compare them with pre-trained word2vec embeddings. Word embeddings play a crucial role in natural language processing tasks, and training your own embeddings can offer insights into the specific nuances and context of your data.
Case-specific word embeddings
!wget -nc https://lazyprogrammer.me/course_files/nlp/bbc_text_cls.csv
from keras.models import Sequential
from keras.layers import Embedding, Dense, Input
from keras.models import Model
from keras.preprocessing.text import Tokenizer
from keras.utils import pad_sequences
import numpy as np
bbc_text = pd.read_csv('bbc_text_cls.csv')
There are various kind of news articles in our dataset. We will be working only with the sports section and only use 100 sentences to train our model, to decrease the model train time.
##### selecting only sport news since we want topic specific suggestions #####
sport_df = bbc_text[bbc_text['labels'] == 'sport']
corpus = sport_df['text']
corpus = corpus[0:100] ### selecting first 100 sentences for faster training ######
Defining the vocabulary and tokenizing the corpus:
# Define the vocabulary
vocab = set()
for sentence in corpus:
for word in sentence.split():
vocab.add(word)
vocab_size = len(vocab)
# Tokenize the corpus
tokenizer = Tokenizer()
tokenizer.fit_on_texts(corpus)
sequences = tokenizer.texts_to_sequences(corpus)
# Set hyperparameters
embedding_dim = 100
window_size = 2
num_epochs = 100
Generating word-pairs as input — out pairs for skip-grams:
# Generate word pairs as input-output pairs for skip-gram
word_pairs = []
for sequence in sequences:
for i in range(len(sequence)):
for j in range(max(0, i - window_size), min(i + window_size + 1, len(sequence))):
if i != j:
word_pairs.append((sequence[i], sequence[j]))
# Prepare input and output data
input_data, output_data = zip(*word_pairs)
input_data = np.array(input_data)
output_data = np.array(output_data)
# Define the model architecture
input_layer = Input(shape=(1,))
embedding_layer = Embedding(vocab_size, embedding_dim)(input_layer)
output_layer = Dense(vocab_size, activation='softmax')(embedding_layer)
### using softmax as it is the standard algorithm to convert a input of K numbers into a probability distribution of K possible outcomes, which is what we need for this exercise ####
## Define the model ##
model = Model(inputs=input_layer, outputs=output_layer)
model.compile(loss='sparse_categorical_crossentropy', optimizer='adam')
# Train the model
model.fit(input_data, output_data, epochs=num_epochs)
# Get the learned word embeddings
weights = model.layers[1].get_weights()[0]
# creating the embedding matrix
word_index = tokenizer.word_index
embedding_matrix = np.zeros((vocab_size, embedding_dim))
for word, i in word_index.items():
if i < vocab_size:
embedding_matrix[i] = weights[i]
Now, we are going to use the above embedding matrix to calculate the words closest to ‘breaks’ and then compare the results with what we get from pre-trained words embeddings.
# Define a function to find the nearest words to a given word
def find_nearest_words(word, embeddings, vocab, n=5):
# Get the index of the word in the vocabulary
word_index = vocab.index(word)
# Get the word embedding
word_embedding = embeddings[word_index]
# Compute the cosine similarity between the word embedding and all other word embeddings
cosine_similarities = np.dot(embeddings, word_embedding) / (np.linalg.norm(embeddings, axis=1) * np.linalg.norm(word_embedding))
# Get the indices of the most similar words
nearest_indices = cosine_similarities.argsort()[-n-1:-1][::-1]
# Get the most similar words
nearest_words = [vocab[i] for i in nearest_indices]
return nearest_words
# Find the nearest words to the word 'sentence'
nearest_words = find_nearest_words('breaks', embedding_matrix, list(vocab))
print("Nearest words to 'sentence':", nearest_words)
Nearest words to 'breaks': ['19-year-old.', 'passing', 'predict', '3:49.78.', 'who']
Pre-trained Embeddings
We are going to use Glove embedding, created by Stanford University.
import gensim.downloader as api
# Load pre-trained word embeddings
embedding_model = api.load("glove-wiki-gigaword-100")
word = "breaks"
nearest_words = embedding_model.most_similar(word)
# Print the nearest words
print("Nearest words to '{}':".format(word))
for word, similarity in nearest_words:
print(word, similarity)
Nearest words to 'breaks': ['break', 'gets', 'breaking', 'goes', 'cut']
Results
As observed in the previous example, the model trained on a specific use case, even with a small dataset of just 100 sentences, provides more contextually relevant results compared to using a pre-trained model obtained from the internet. For instance, when we input the word ‘breaks’ into our custom-trained model, it accurately associates it with concepts such as ‘record’ and ’19 years old.’ On the other hand, the pre-trained model offers more generic and less specific word associations. This highlights the advantage of training embeddings tailored to the specific domain or task at hand, as they can capture domain-specific semantics and produce more meaningful results.
Follow me on linkedin: