
A practitioner's view on embeddings... with a touch of SAP BTP!
I'm writing these lines with the motivation of saving you time understanding what embeddings are, what they are used for, how to pick 'the right' embedding for your application under consideration of performance, data security and latency. Over the past year I came across this topic so many times that I want to give a braindump on practical findings. Let me know your thoughts and suggestions below.
Generative AI refers to a class of artificial intelligence that specializes in creating new content, whether that be text, images, sounds, or even video. Unlike discriminative models that classify or predict based on input data, generative models can generate novel data samples. Applications of generative AI are vast and include tasks such as synthesizing realistic human speech, generating art or music, designing new drugs, and creating virtual environments for gaming and simulations.
At the heart of generative AI's ability to produce new content are embeddings. Embeddings are dense, low-dimensional, and continuous vector representations of high-dimensional data. They are the foundation upon which generative models understand and manipulate data. For example, in natural language processing (NLP), words, sentences, or entire documents are converted into vectors that capture semantic meaning and context. In image processing, embeddings might represent key features of an image that allow a model to generate similar but unique images.
The role of embeddings in generative AI cannot be overstated. They serve as a bridge between the raw, often unstructured data and the sophisticated neural networks that process them. By translating data into a format that AI models can efficiently work with, embeddings enable models to discern patterns, make associations, and ultimately generate new content that is coherent and contextually relevant.
The importance of embeddings lies in their ability to capture the essence of the data. For text, this means understanding synonyms, analogies, and the subtleties of language. For images, it involves recognizing shapes, textures, and colors. This transformation of raw data into a meaningful vector space is what allows generative AI to be creative and insightful, pushing the boundaries of what machines can produce.
In the following sections, we will delve deeper into the types of embeddings, their applications in various domains of generative AI, and the critical considerations one must make when choosing the right embedding model for a given task.
There are quite a few types of embeddings to distinguish: Word embeddings, sentence embeddings, image embeddings, audio embeddings and recently multimodal embeddings (e.g. combining image and text). For this blog I like to focus on word and sentence embeddings.
Word embeddings are vector representations of individual words. They capture the semantic meaning of words by placing semantically similar words close to each other in the embedding space. Word embeddings are typically used when planning:
Word-Level Tasks: You are working on tasks that require understanding or processing at the word level, such as part-of-speech tagging, named entity recognition, or word sense disambiguation.
on the other hand are vector representations of entire sentences or phrases. They are designed to capture the meaning of the sentence as a whole, taking into account word order and the interactions between words. Sentence embeddings are typically used when:
Well, you can read about that in the literature but what I found is that with modern embeddings derived from GPT models combine both models often, so if you feed in single words, you get the word embedding, feeding in sentences gives you the advantages you get from sentence embeddings. That's nice!
Retrieval-augmented Generation: That's likely one of the big use cases for embeddings. In a nutshell: large language models still have only a very limited context window. It's said that GPT-4 Turbo offers a 128k token context window - which is between 300-400 pages of text. Impressive! Still we want more: For once, it's not much compared to the information out there, and then: Commercial models charge per token, so why would we want to search the needle in the haystack if we got to pay for the haystack and the needle? It would be better to say 'find the needle' in terabytes of data without the data even going to the cloud! Cheaper, faster, less energy consuming.
That's where vector stores enter the stage (e.g. SAP HANA Vector Engine) to hold a complex vector representation of what was once your text.
The vector representation of what was text before is generated by the according embedding model you picked for that task. For example a model might place an apple close to a pear while a tomato would be further away. Now imagine that with other dimensions that establish context or represent language. An apple in Japanese would likely not be far away from an 'English' apple - something which might be very much wanted to allow for multilingual searches of meanings in international texts. Here's a nice playground to understand that idea better.
'My company uses GPT models from (Azure) OpenAI so I just use their embedding model, done!' - Ah yes, you can do that, but one reason I wrote the blog is because I think it's not that simple. Consider these aspects:
Here the embeddings from OpenAI certainly have a point - they perform quite well and deliver acceptable results in average. But they are not the best. Say you need a great performance for clustering or pairing or classification of text - there is no 'one for all' model. Or you want great performance with Chinese text only and so on. So model performance delivering not just an embedding but allowing for finding what you search later is key.
We all work with cloud solutions - and we all do it by weighing benefits for and risks of processing. Sending your business text data to retrieve embeddings to an entity outside of your defined cloud space might be unnecessary risk.
Commercial models almost always charge for creating embeddings - not much but if you embed giga and terabytes of data it sums up.
That might be an important point to consider: You move large text data out to a cloud service, you move even larger data back in (if your vector engine is with your cloud or on-premise entity) and you experience sometimes massive network latency depending on the size of chunks the service can digest.
I suggest to consider above 4 points when going for a generative AI project that will need embeddings to represent your business data. If any of these 4 points is a concern it's already worth looking into embedding models other than the cloud services.
Say, you want to convert your business text data in an extension you create on one of SAP BTP's runtimes: Cloud Foundry or Kyma. Then - for the moment - the throughput from text to embedding per time unit on CPU is crucial. You need to also understand what you want to derive from the text later as mentioned above (e.g. retrieval, clustering or other tasks). Are your texts in one language only? Good! There are high performing models that can process texts even on CPUs with great speed.
Many great performing models request a tribute: Computing power. One approach you can take is deploying the according model on SAP AI Core to leverage GPU acceleration. For that using SentenceTransformers is an approach I found very handy. Deploy your own service as a web service you can then consume in your extension application.
I was looking for a model with a good overall performance and the ability to work multilingual while having only CPU available to compute the embeddings. Below table contains my findings.
Model | Time to embed in seconds | Japanese/ English retrieval | Multilanguage practical test |
all-MiniLM-L6-v2 | 1.473 | 0.154 | Not good |
all-MiniLM-L12-v2 | 1.5 | 0.217 | Not good |
intfloat/multilingual-e5-small | 2.724 | 0.931 | ok |
avsolatorio/GIST-Embedding-v0 | 7.813 | 0.639 | not measured |
intfloat/multilingual-e5-base | 16.037 | 0.929 | ok |
thenlper/gte-large | 19.825 | not measured | Not good |
BAAI/bge-m3 | 21.767 | 0.879 | ok |
llmrails/ember-v1 | 29.61 | not measured | Not good |
WhereIsAI/UAE-Large-V1 | 31.877 | not measured | not measured |
intfloat/multilingual-e5-large | 48.056 | not measured | not measured |
intfloat/e5-large | not measured | 0.739 | not measured |
intfloat/e5-base | not measured | 0.724 | not measured |
Time to embed is measured on my own notebook on CPU how long it took to embed a Japanese document of 56 A4 pages.
Japanese/English retrieval is how good the model performs with 2 sentences that have the same meaning in both Japanese and English. That is an important characteristic if you want to run e.g. an English search on a Japanese text or vice-versa.
The last column finally is my very subjective judgement considering all 3 criteria for my use case. I then decided to use intfloat/multilingual-e5-small for my project.
As we wrap up this discussion on embeddings, it's clear that their role in Generative AI is both fundamental and multifaceted. This blog has aimed to demystify the concept of embeddings, providing a practical viewpoint on their selection and application.
By considering the key factors of performance, data privacy, cost, and latency, we've navigated the complexities that come with choosing the right embedding for your specific needs. The shared experiences and findings serve as a guide to help you make informed decisions, ensuring that your AI projects are not only effective but also aligned with your operational constraints and goals. Make the most out of embeddings to enhance your business data's representation!
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
User | Count |
---|---|
21 | |
12 | |
12 | |
11 | |
10 | |
9 | |
9 | |
8 | |
7 | |
7 |