Experiencing Embeddings with the First Baby Step

V_D_Vincent

The following are the objectives in this blog:

Create Embedding
Store Embedding in a vector database
Read Embeddings- understand cosine similarity with dot product and Euclidean distance

Overview of the set up and flow of the blog

I. Create Embeddings

Before we create an embedding, lets understand what is an embedding and what are the models available?

Embeddings are numerical representations of concepts converted to number sequences, which make it easy for computers to understand the relationships between those concepts.

Some of the models that are available with OpenAI are:

Operations	Models	Use Cases
Text similarity: Captures semantic similarity between pieces of text.	text-similarity-{ada, babbage, curie, davinci}-001	Clustering, regression, anomaly detection, visualization
Text search: Semantic information retrieval over documents.	text-search-{ada, babbage, curie, davinci}-{query, doc}-001	Search, context relevance, information retrieval
Code search: Find relevant code with a query in natural language.	code-search-{ada, babbage}-{code, text}-001	Code search and relevance

Setup done to create Embedding from Visual Studio Code

Step 1: Create an account at https://openai.com/

Step 2: Select API

Step 3: Create an API Key

Step 4: Install visual studio code from https://code.visualstudio.com/download

Step 5: Install POSTMAN in the extension of Visual studio code and set up the API reference key

Step 6: Setting up and Executing the API to create the Embedding

The Embedding is returned as below consuming 4 tokens

{

"object": "list",

"data": [

{

"object": "embedding",

"index": 0,

"embedding": [

0.006846516,

-0.017028954,

-0.033345513,

-0.03129053.........

]

}

],

"model": "text-embedding-ada-002",

"usage": {

"prompt_tokens": 4,

"total_tokens": 4

}

II. Store the Embedding

Embedding need a special type of database called vector databases for storing the vectors. I have used singlesource. There are many vector databases like pinecone. SAP has just announced their own SAP HANA cloud vector engine. We will subscribe to a singlestore and understand some of its component. We will create a vector database, a table and store embedding.

The set up done to store an embedding in the singlesource vector database.

Step 1: Sign up into singlesource

Step 2: Attaching your workspace and objects. By default after 20 minutes of inactivity all objects are suspended in singlestore, requiring you to attach them (activating them) prior to operations.

Step 3: Hierarchy and objects for storing the embedding

Step 4: Insert the text and its embedding into the vector database with SQL editor from the development pane

Congratulations, now we can do various operations with the vectors. Explore if King-Man+Woman= Queen?

III. Reading the embedding

Unlike in a relation database where we search/ read for a text match like an employee pernr or material matnr, with embedding we are searching for the similarity between the vectors. Since similar meaning vectors are concentrated in a cluster within the vectordatabase.

To compute the similarity between the vectors, we will need to understand a few mathematical models. We will understand two models widely used:

Cosine similarity/ Dot Product: Where the cosine value of the angle formed by the vectors from the origin defines the similarity score

Assume there are two vectors P(1) with two words – “Hello World” and P(2) with one word “Hello”.

Step 1: Tabulate the number of times each of these words occur in the phrases.

Step 1: Plot the first point x, for P(1) at x(1,1)

Step 2: Plot the second point, y for P(2) at y(0,1)

Step 3: Extend the line from the Origin(0,0) to intersect the point x and y.

Step 4: Measure the angle formed by these rays which is 45°

Step 5: Cosine value of the angle is 0.71

Note if the vector is multidimensional then the formula is used to compute the similarity value.

Euclidean distance: Where the distance between the vectors defines the similarity score

Assume there are two vectors P(1) with two words – “Hello World” and P(2) with one word “Hello”.

Step 1: Tabulate the number of times each of these words occur in the phrases.

Step 1: Plot the first point x, for P(1) at x(1,1)

Step 2: Plot the second point, y for P(2) at y(0,1)

Step 3: Measure the distance between the two vectors with the Euclidean formula

Now that you understand the concepts behind the similarity value, Lets set up the system.

Step 1: The table Myvectortable in the Openaidatabase has been populated with the embeddings for the below texts

Step 2: Computing the similarity value for vector- “Hi” and “Cya” using the dot_product approach

The words “Hi” and “Cya” are not stored in the Myvectortable. The embeddings for “Hi” and “Cya” are computed from OpenAI API as in the step creating embedding. The embedding is sent as a parameter to read from the Myvectortable in the below query:

The results for both the queries are:

The embedding “Hi” returns a higher similarity value for “Hello world” while the embedding “Cya” returns a higher similarity value for “Bye Bye”.

Conclusion

“I am still learning.”—Michelangelo

I hope this blog helps in tapping into your curious mind and encourages you to be part of the AI journey. In relation to the big picture with reference to SAP, this blog is part of the vector engine scope.

Source: SAP

Great inspiration from the below blogs-

https://community.sap.com/t5/technology-blogs-by-sap/which-embedding-model-should-i-use-with-my-corp...

https://community.sap.com/t5/technology-blogs-by-sap/lets-add-2-custom-embedding-models-to-sap-ai-co...

https://community.sap.com/t5/artificial-intelligence-and-machine-learning-blogs/sap-machine-learning...

https://community.sap.com/t5/technology-blogs-by-sap/share-corporate-info-with-an-llm-using-embeddin...

https://community.sap.com/t5/technology-blogs-by-sap/demystifying-transformers-and-embeddings-some-g...

https://community.sap.com/t5/application-development-blog-posts/editing-json-embedded-in-abap-string...

https://community.sap.com/t5/technology-blogs-by-sap/generative-ai-some-thoughts-on-using-embeddings...

https://community.sap.com/t5/technology-blogs-by-sap/how-sap-s-generative-ai-hub-facilitates-embedde...