
The following are the objectives in this blog:
Overview of the set up and flow of the blog
I. Create Embeddings
Before we create an embedding, lets understand what is an embedding and what are the models available?
Embeddings are numerical representations of concepts converted to number sequences, which make it easy for computers to understand the relationships between those concepts.
Some of the models that are available with OpenAI are:
Operations | Models | Use Cases |
Text similarity: Captures semantic similarity between pieces of text. | text-similarity-{ada, babbage, curie, davinci}-001 | Clustering, regression, anomaly detection, visualization |
Text search: Semantic information retrieval over documents. | text-search-{ada, babbage, curie, davinci}-{query, doc}-001 | Search, context relevance, information retrieval |
Code search: Find relevant code with a query in natural language. | code-search-{ada, babbage}-{code, text}-001 | Code search and relevance |
Setup done to create Embedding from Visual Studio Code
Step 1: Create an account at https://openai.com/
Step 2: Select API
Step 3: Create an API Key
Step 4: Install visual studio code from https://code.visualstudio.com/download
Step 5: Install POSTMAN in the extension of Visual studio code and set up the API reference key
Step 6: Setting up and Executing the API to create the Embedding
The Embedding is returned as below consuming 4 tokens
{
"object": "list",
"data": [
{
"object": "embedding",
"index": 0,
"embedding": [
0.006846516,
-0.017028954,
-0.033345513,
-0.03129053.........
]
}
],
"model": "text-embedding-ada-002",
"usage": {
"prompt_tokens": 4,
"total_tokens": 4
}
}
II. Store the Embedding
Embedding need a special type of database called vector databases for storing the vectors. I have used singlesource. There are many vector databases like pinecone. SAP has just announced their own SAP HANA cloud vector engine. We will subscribe to a singlestore and understand some of its component. We will create a vector database, a table and store embedding.
The set up done to store an embedding in the singlesource vector database.
Step 1: Sign up into singlesource
Step 2: Attaching your workspace and objects. By default after 20 minutes of inactivity all objects are suspended in singlestore, requiring you to attach them (activating them) prior to operations.
Step 3: Hierarchy and objects for storing the embedding
Step 4: Insert the text and its embedding into the vector database with SQL editor from the development pane
Congratulations, now we can do various operations with the vectors. Explore if King-Man+Woman= Queen?
III. Reading the embedding
Unlike in a relation database where we search/ read for a text match like an employee pernr or material matnr, with embedding we are searching for the similarity between the vectors. Since similar meaning vectors are concentrated in a cluster within the vectordatabase.
To compute the similarity between the vectors, we will need to understand a few mathematical models. We will understand two models widely used:
Cosine similarity/ Dot Product: Where the cosine value of the angle formed by the vectors from the origin defines the similarity score
Assume there are two vectors P(1) with two words – “Hello World” and P(2) with one word “Hello”.
Step 1: Tabulate the number of times each of these words occur in the phrases.
Step 1: Plot the first point x, for P(1) at x(1,1)
Step 2: Plot the second point, y for P(2) at y(0,1)
Step 3: Extend the line from the Origin(0,0) to intersect the point x and y.
Step 4: Measure the angle formed by these rays which is 45°
Step 5: Cosine value of the angle is 0.71
Note if the vector is multidimensional then the formula is used to compute the similarity value.
Euclidean distance: Where the distance between the vectors defines the similarity score
Assume there are two vectors P(1) with two words – “Hello World” and P(2) with one word “Hello”.
Step 1: Tabulate the number of times each of these words occur in the phrases.
Step 1: Plot the first point x, for P(1) at x(1,1)
Step 2: Plot the second point, y for P(2) at y(0,1)
Step 3: Measure the distance between the two vectors with the Euclidean formula
Now that you understand the concepts behind the similarity value, Lets set up the system.
Step 1: The table Myvectortable in the Openaidatabase has been populated with the embeddings for the below texts
Step 2: Computing the similarity value for vector- “Hi” and “Cya” using the dot_product approach
The words “Hi” and “Cya” are not stored in the Myvectortable. The embeddings for “Hi” and “Cya” are computed from OpenAI API as in the step creating embedding. The embedding is sent as a parameter to read from the Myvectortable in the below query:
The results for both the queries are:
The embedding “Hi” returns a higher similarity value for “Hello world” while the embedding “Cya” returns a higher similarity value for “Bye Bye”.
Conclusion
“I am still learning.”—Michelangelo
I hope this blog helps in tapping into your curious mind and encourages you to be part of the AI journey. In relation to the big picture with reference to SAP, this blog is part of the vector engine scope.
Source: SAP
Great inspiration from the below blogs-
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
User | Count |
---|---|
11 | |
10 | |
9 | |
7 | |
6 | |
5 | |
5 | |
5 | |
5 | |
4 |