
Objective
With the advent of Machine Learning and Natural Language Processing, word embeddings have become a fundamental tool. The first step of the transformer architecture is to tokenize the words and represent them as vectors in a high-dimensional space, which is accomplished by the embedding layer.
I was curious to visualize the vector representation and explore whether we could train a vector on a small dataset rather than using a large pre-trained model. Additionally, I want to explore the SAP HANA vector engine and use TensorBoard to visualize my own dataset. After multiple attempts, I found a way to satisfy my curiosity by combining the power of the SAP HANA vector engine and TensorBoard for visualizing my own dataset.
In this blog post, we will explore how to store vector data in SAP HANA DB, visualize the vectors using TensorBoard, and compare the Cosine similarity and Euclidean distances of a word in SAP HANA DB and TensorBoard.
Steps:
HANA Vector Engine:
To know about HANA Vector Engine refer these blogs,
Let's first create a database table WORD_EMBEDDINGS with a vector of dimension 100
CREATE COLUMN TABLE "WORD2VECTOR"."WORD_EMBEDDINGS"(
"ID" INTEGER,
"WORD" NVARCHAR(40),
"VECTOR" REAL_VECTOR(100),
PRIMARY KEY("ID")
)UNLOAD PRIORITY 5 AUTO MERGE;
The dataset is a text document prepared from https://help.sap.com/docs/sap-ai-core/sap-ai-core-service-guide/what-is-sap-ai-core.
Generate word embeddings:
The dataset is trained on Word2Vec model which is designed to work at the word level within local contexts. The file content is preprocessed into passages or chunks before tokenizing them into words. This way we retain the contextual information within each passage while still adhering to the word-level training of Word2Vec. The embeddings learned will still be at the word level, capturing the relationships between words based on their co-occurrence.
We will use the Natural Language Toolkit libraries and Punkt sentence tokenizer for processing the text.
import hdbcli
import nltk
import string
import chardet
from gensim.models import Word2Vec
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from hdbcli import dbapi
from nltk.tokenize import sent_tokenize, word_tokenize
# Download the tokenizer from NLTK.
nltk.download('punkt')
Define function detect_encoding to read the raw bytes from the file and use chardet.detect to detect the encoding of the file.
def detect_encoding(file_path):
with open(file_path, 'rb') as file:
raw_data = file.read()
result = chardet.detect(raw_data)
return result['encoding']
Define function preprocess_text to tokenize the sentence into words and convert words to lowercase and remove punctuations
def preprocess_text(sentence):
words = word_tokenize(sentence)
words = [word.lower() for word in words if word.isalnum()]
return words
Define function process_file to process the content. The file content is split into sentences using sent_tokenize and each sentence is tokenized into words using word_tokenize. Splitting the file content into smaller units like sentences leads to more accurate and meaningful word embeddings.
def process_file(file_path):
# Detect the file encoding
encoding = detect_encoding(file_path)
with open(file_path, 'r', encoding=encoding, errors='ignore') as file:
content = file.read()
# Split the content into sentences
sentences = sent_tokenize(content)
# Tokenize and preprocess each sentence
tokenized_sentences = [preprocess_text(sentence) for sentence in sentences if isinstance(sentence, str)]
return tokenized_sentences
Let's start by processing the file content by converting it to lowercase, removing punctuations and tokenize the text by calling function process_file
# Path to the file
file_path = "C:/Users/Arun/tensorflow_datasets/Doc/SAPAICore.txt"
# Process the file
tokenized_sentences = process_file(file_path)
Next step to generate word embeddings using the Word2Vec model. Word2Vec has 2 algorithms, CBOW(default) and Skip-gram,
To explicitly choose between CBOW and Skip-gram, you can use the sg parameter, sg=0 for CBOW (default). sg=1 for Skip-gram.
# Train Word2Vec model
model = Word2Vec(tokenized_sentences, vector_size=100, window=5, min_count=1, workers=4)
The primary input for Word2Vec is a list of lists of words. Each inner list represents a sentence or a document.
Populate the HANA DB:
Connect to the HANA DB and store the word vectors into table WORD_EMBEDDINGS
conn = dbapi.connect(
address='xxxxxxxxxxxx.hana.trial-us10.hanacloud.ondemand.com',
port='443',
user='xxxxxx',
password='xxxxxxxx',
encrypt=True
# sslValidateCertificate=True
)
cursor = conn.cursor()
words = list(model.wv.index_to_key)
for idx, word in enumerate(words):
vector = model.wv[word]
vector_str = ','.join(map(str, vector))
sql_command = '''INSERT INTO "DBADMIN"."EMBEDDINGS" ("ID","WORD","TOKEN") VALUES (?,?,TO_REAL_VECTOR(?))'''
vector_str = '['+vector_str+']'
val = (idx,word,vector_str)
cursor.execute(sql_command,val)
connection.commit()
cursor.close()
connection.close()
Let's check the word embeddings in HANA DB. Table WORD_EMBEDDINGS contain 801 unique words with vector of dimension 100
Visualize the Embeddings in TensorBoard:
Next step is to visualize the embeddings in TensorBoard projector. To visualize the vector the embeddings are exported in a TSV file (embeddings.tsv) and the corresponding metadata (tokenized text) is saved in another TSV file (metadata.tsv). The embeddings are loaded into a TensorFlow variable and saved as a checkpoint(embedding.ckpt).
import tensorflow as tf
import subprocess
# Create a directory to store TensorBoard logs
log_dir = 'log_Embeddings'
if not os.path.exists(log_dir):
os.makedirs(log_dir)
# Save embeddings and metadata
with open(os.path.join(log_dir, 'metadata.tsv'), 'w') as metadata_file:
for word in model.wv.index_to_key:
metadata_file.write(f"{word}\n")
embedding_matrix = np.array([model.wv[word] for word in model.wv.index_to_key])
np.savetxt(os.path.join(log_dir, 'embeddings.tsv'), embedding_matrix, delimiter='\t')
# TensorFlow variable for the embeddings
embedding_var = tf.Variable(embedding_matrix, name='word2vec_embeddings')
# Checkpoint to save the embeddings
checkpoint = tf.train.Checkpoint(embedding=embedding_var)
checkpoint.save(os.path.join(log_dir, 'embedding.ckpt'))
Embeddings, metadata and checkpoint files will be downloaded at the directory path
The TensorBoard projector is configured to visualize the embeddings. Configure the TensorFlow projector to visualize the embeddings using the metadata file. TensorBoard is launched using subprocess to visualize the embeddings and metadata at default port 6006
# Configure the projector
config = projector.ProjectorConfig()
embedding = config.embeddings.add()
embedding.tensor_name = embedding_var.name
embedding.metadata_path = os.path.join(log_dir, 'metadata.tsv')
projector.visualize_embeddings(log_dir, config)
#Launch TensorBoard using subprocess
tb_process = subprocess.Popen(["tensorboard", "--logdir", log_dir])
print("Access TensorBoard at http://localhost:6006/")
An alternate approach is to launch tensorflow projector and load the TSV file of vectors(embeddings.tsv) and metadata. TensorBoard shows the vector representation of 801 tokens in dimension 100.
Compare Results:
We have come to the final step, we will search for a word in the TensorBoard and in HANA Vector Engine and find its nearest tokens and compare the results. First, let's search the TensorBoard for the word 'dataset' applying Cosine similarity
Let's check the same using Euclidean distance
Now, let's run the same search in HANA Vector Engine using Cosine similarity
Search for word 'dataset' using Euclidean distance
Nearest points like 'alongside','additionally','xusaa','vs','operation' are present in TensorBoard and HANA Vector Engine though there are slight differences seen. This might be because of trivial variations in the implementation of Cosine similarity/Euclidean distance in TensorBoard and SAP HANA DB. For example, numerical precision, handling of zero vectors, or normalization methods could differ.
Conclusion:
In this blog post I've demonstrated how to generate word embeddings using Word2Vec, store these embeddings in SAP HANA DB, visualize them using TensorBoard, and compare the cosine and Euclidean distances of word vectors between SAP HANA Vector Engine and TensorBoard. This process provides valuable insights into the semantic relationships captured by word embeddings and showcases the powerful capabilities of SAP HANA DB and TensorBoard for handling and visualizing vector data.
Note: When importing Word2Vec from gensim you might encounter an error "ImportError: cannot import name 'triu' from 'scipy.linalg". To fix this error open matutils.py (will be stored at \AppData\Local\Programs\Python\Python312\Lib\site-packages\gensim\matutils.py) and import triu from numpy instead of scipy.linalg.
Regards,
ArunKumar Balakrishnan
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
User | Count |
---|---|
19 | |
18 | |
16 | |
10 | |
8 | |
7 | |
7 | |
7 | |
6 | |
6 |