Technology Blog Posts by SAP
Learn how to extend and personalize SAP applications. Follow the SAP technology blog for insights into SAP BTP, ABAP, SAP Analytics Cloud, SAP HANA, and more.
cancel
Showing results for 
Search instead for 
Did you mean: 
ArunKumar_Balakrishnan
Product and Topic Expert
Product and Topic Expert
1,275

Objective

With the advent of Machine Learning and Natural Language Processing, word embeddings have become a fundamental tool. The first step of the transformer architecture is to tokenize the words and represent them as vectors in a high-dimensional space, which is accomplished by the embedding layer.

I was curious to visualize the vector representation and explore whether we could train a vector on a small dataset rather than using a large pre-trained model. Additionally, I want to explore the SAP HANA vector engine and use TensorBoard to visualize my own dataset. After multiple attempts, I found a way to satisfy my curiosity by combining the power of the SAP HANA vector engine and TensorBoard for visualizing my own dataset.

In this blog post, we will explore how to store vector data in SAP HANA DB, visualize the vectors using TensorBoard, and compare the Cosine similarity and Euclidean distances of a word in SAP HANA DB and TensorBoard.

Steps:

  1. Read content from file and generate word embeddings using Word2Vec
  2. Store the embeddings in HANA DB
  3. Visualize the embeddings in TensorBoard projector
  4. Compare the cosine and Euclidean distances of word vectors between HANA DB and TensorBoard

HANA Vector Engine:

To know about HANA Vector Engine refer these blogs,

Let's first create a database table WORD_EMBEDDINGS with a vector of dimension 100

 

 

CREATE COLUMN TABLE "WORD2VECTOR"."WORD_EMBEDDINGS"(
	"ID" INTEGER,
	"WORD" NVARCHAR(40),
	"VECTOR" REAL_VECTOR(100),
	PRIMARY KEY("ID")
)UNLOAD PRIORITY 5 AUTO MERGE;

 

 

The dataset is a text document prepared from https://help.sap.com/docs/sap-ai-core/sap-ai-core-service-guide/what-is-sap-ai-core.  

Generate word embeddings:

The dataset is trained on Word2Vec model which is designed to work at the word level within local contexts. The file content is preprocessed into passages or chunks before tokenizing them into words. This way we retain the contextual information within each passage while still adhering to the word-level training of Word2Vec. The embeddings learned will still be at the word level, capturing the relationships between words based on their co-occurrence.

We will use the Natural Language Toolkit libraries and Punkt sentence tokenizer for processing the text.

 

 

import hdbcli
import nltk
import string
import chardet

from gensim.models import Word2Vec
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from hdbcli import dbapi
from nltk.tokenize import sent_tokenize, word_tokenize

# Download the tokenizer from NLTK.
nltk.download('punkt')

 

 

Define function detect_encoding to read the raw bytes from the file and use chardet.detect to detect the encoding of the file.

 

 

def detect_encoding(file_path):
    with open(file_path, 'rb') as file:
        raw_data = file.read()
    result = chardet.detect(raw_data)
    return result['encoding']

 

 

Define function preprocess_text to tokenize the sentence into words and convert words to lowercase and remove punctuations

 

 

def preprocess_text(sentence):    
    words = word_tokenize(sentence)        
    words = [word.lower() for word in words if word.isalnum()]    
    return words

 

 

Define function process_file to process the content. The file content is split into sentences using sent_tokenize and each sentence is tokenized into words using word_tokenize. Splitting the file content into smaller units like sentences leads to more accurate and meaningful word embeddings.

 

 

def process_file(file_path):
    # Detect the file encoding
    encoding = detect_encoding(file_path)
    
    with open(file_path, 'r', encoding=encoding, errors='ignore') as file:
        content = file.read()
    # Split the content into sentences
    sentences = sent_tokenize(content)
    # Tokenize and preprocess each sentence
    tokenized_sentences = [preprocess_text(sentence) for sentence in sentences if isinstance(sentence, str)]

    return tokenized_sentences

 

 

Let's start by processing the file content by converting it to lowercase, removing punctuations and tokenize the text by calling function process_file

 

 

# Path to the file
file_path = "C:/Users/Arun/tensorflow_datasets/Doc/SAPAICore.txt"

# Process the file
tokenized_sentences = process_file(file_path)

 

 

Next step to generate word embeddings using the Word2Vec model. Word2Vec has 2 algorithms, CBOW(default) and Skip-gram,

  1. Continuous Bag of Words (CBOW): Predicts a target word from the context of surrounding words.
  2. Skip-gram: Predicts the context words from a given target word.

To explicitly choose between CBOW and Skip-gram, you can use the sg parameter, sg=0 for CBOW (default). sg=1 for Skip-gram.

 

 

# Train Word2Vec model
model = Word2Vec(tokenized_sentences, vector_size=100, window=5, min_count=1, workers=4)

 

 

The primary input for Word2Vec is a list of lists of words. Each inner list represents a sentence or a document.

  • tokenized_sentences - sentence used for training the model
  • vector_size - the dimensionality of the word vectors, it is set to 100.
  • window -  the maximum distance between the current and predicted word within a sentence, set to 5.
  • min_count -  ignores all words with a total frequency lower than this, set to 1.
  • workers - the number of worker threads to use for training.

Populate the HANA DB:

Connect to the HANA DB and store the word vectors into table WORD_EMBEDDINGS

 

 

conn = dbapi.connect(
    address='xxxxxxxxxxxx.hana.trial-us10.hanacloud.ondemand.com',
    port='443', 
    user='xxxxxx', 
    password='xxxxxxxx', 
    encrypt=True
    # sslValidateCertificate=True
)
cursor = conn.cursor()
words = list(model.wv.index_to_key)

for idx, word in enumerate(words):    
    vector = model.wv[word]
    vector_str = ','.join(map(str, vector))
    sql_command = '''INSERT INTO "DBADMIN"."EMBEDDINGS" ("ID","WORD","TOKEN") VALUES (?,?,TO_REAL_VECTOR(?))'''
    vector_str = '['+vector_str+']'
    val = (idx,word,vector_str)
    cursor.execute(sql_command,val)

connection.commit()
cursor.close()
connection.close()

 

 

Let's check the word embeddings in HANA DB. Table WORD_EMBEDDINGS contain 801 unique words with vector of dimension 100

HANA DB.jpg

Visualize the Embeddings in TensorBoard:

Next step is to visualize the embeddings in TensorBoard projector. To visualize the vector the embeddings are exported in a TSV file (embeddings.tsv) and the corresponding metadata (tokenized text) is saved in another TSV file (metadata.tsv). The embeddings are loaded into a TensorFlow variable and saved as a checkpoint(embedding.ckpt).

 

 

import tensorflow as tf
import subprocess

# Create a directory to store TensorBoard logs
log_dir = 'log_Embeddings'
if not os.path.exists(log_dir):
    os.makedirs(log_dir)

# Save embeddings and metadata
with open(os.path.join(log_dir, 'metadata.tsv'), 'w') as metadata_file:
    for word in model.wv.index_to_key:
        metadata_file.write(f"{word}\n")

embedding_matrix = np.array([model.wv[word] for word in model.wv.index_to_key])
np.savetxt(os.path.join(log_dir, 'embeddings.tsv'), embedding_matrix, delimiter='\t')
# TensorFlow variable for the embeddings
embedding_var = tf.Variable(embedding_matrix, name='word2vec_embeddings')
# Checkpoint to save the embeddings
checkpoint = tf.train.Checkpoint(embedding=embedding_var)
checkpoint.save(os.path.join(log_dir, 'embedding.ckpt'))

 

 

Embeddings, metadata and checkpoint files will be downloaded at the directory path

Embeddings downloaded.jpg

The TensorBoard projector is configured to visualize the embeddings. Configure the TensorFlow projector to visualize the embeddings using the metadata file. TensorBoard is launched using subprocess to visualize the embeddings and metadata at default port 6006

 

 

# Configure the projector
config = projector.ProjectorConfig()
embedding = config.embeddings.add()
embedding.tensor_name = embedding_var.name
embedding.metadata_path = os.path.join(log_dir, 'metadata.tsv')
projector.visualize_embeddings(log_dir, config)
#Launch TensorBoard using subprocess
tb_process = subprocess.Popen(["tensorboard", "--logdir", log_dir])
print("Access TensorBoard at http://localhost:6006/")

 

 

An alternate approach is to launch tensorflow projector and load the TSV file of vectors(embeddings.tsv) and metadata. TensorBoard shows the vector representation of 801 tokens in dimension 100.

2024-06-18_22-11-42.gif

Compare Results:

We have come to the final step, we will search for a word in the TensorBoard and in HANA Vector Engine and find its nearest tokens and compare the results. First, let's search the TensorBoard for the word 'dataset' applying Cosine similarity

2024-06-18_22-13-15.gif

Let's check the same using Euclidean distance

2024-06-18_22-13-53.gif

Now, let's run the same search in HANA Vector Engine using Cosine similarityHANA DB Cosine.jpg

Search for word 'dataset' using Euclidean distanceHANA DB Euclidean.jpg

Nearest points like 'alongside','additionally','xusaa','vs','operation' are present in TensorBoard and HANA Vector Engine though there are slight differences seen. This might be because of trivial variations in the implementation of Cosine similarity/Euclidean distance in TensorBoard and SAP HANA DB. For example, numerical precision, handling of zero vectors, or normalization methods could differ.

Conclusion:

In this blog post I've demonstrated how to generate word embeddings using Word2Vec, store these embeddings in SAP HANA DB, visualize them using TensorBoard, and compare the cosine and Euclidean distances of word vectors between SAP HANA Vector Engine and TensorBoard. This process provides valuable insights into the semantic relationships captured by word embeddings and showcases the powerful capabilities of SAP HANA DB and TensorBoard for handling and visualizing vector data.

Note: When importing Word2Vec from gensim you might encounter an error "ImportError: cannot import name 'triu' from 'scipy.linalg". To fix this error open matutils.py (will be stored at \AppData\Local\Programs\Python\Python312\Lib\site-packages\gensim\matutils.py) and import triu from numpy instead of scipy.linalg. 

Regards,

ArunKumar Balakrishnan