Authors: @YatseaLi, @cesarecalabria, @amagnani, @jacobahtan
In the previous blog post about Bring Open-Source or or Open-Weight LLMs into SAP AI Core, we have gone through an overview introduction of deploying and running open-source LLMs in SAP AI Core with BYOM approach, the use cases of open-source LLMs, and the sample application byom-oss-llm-ai-core and its solution architecture, and various options of leveraging open-source LLM Inference Servers to serve the open-source LLMs within SAP AI Core, such as Ollama, LocalAI, llama.cpp and vLLM. (In the rest of this article, I will use Open-Source LLMs representing both Open-Source and Open-Weight LLMs for simplification)
Here you have the blog post series.
Blog post series of Bring Open-Source LLMs into SAP AI Core |
Part 1 – Bring Open-Source LLms into SAP AI Core: Overview Part 2 – Bring Open-Source LLMs into SAP AI Core with Ollama Part 3 – Bring Open-Source LLMs into SAP AI Core with Custom Transformer Server Part 4 – Bring Open-Source Text Embedding Models into SAP AI Core with Infinity (this blog post) Part 5 – Bring Open-Source LLMs into SAP AI Core with LocalAI (to be published) Part 6 – Bring Open-Source LLMs into SAP AI Core with llama.cpp (to be published) Part 7 – Bring Open-Source LLMs into SAP AI Core with vLLM (to be published) Note: You can try it out the sample AI Core sample app byom-oss-llm-ai-core by following its manual here with all the technical details. The followup blog posts will just wrap up the technical details of each option. |
In this blog post, we'll have an end-to-end technical deep dive into the first option of bringing open-source text embedding models within SAP AI Core through Infinity.
Infinity is a high-performance REST API for serving vector embeddings generated by various text-embedding models. It's open-source under the MIT license, making it freely available for integration into SAP AI Core.
Key features of Infinity include:
Here you have a short demo about Infinity.
With Infinity, running it locally on your own computer is very straight-forward with three steps: Install > Start > Inference
Infinity is freely available for download from its website.
pip install infinity-emb[all]
After your pip install, with your venv active, you can run the CLI directly:
infinity_emb --url-prefix "/v1" --model-name-or-path "nreimers/MiniLM-L6-H384-uncased" --port "7998"
If you would like to understand what are the various options/parameters:
infinity_emb --help
For our case, we used one of the models from Massive Text Embedding Benchmark (MTEB), nreimers/MiniLM-L6-H384-uncased.
Once it's is up and running, now we can inference the text embedding model. First, we can first check if the model is defined correctly.
curl -X 'GET' \
'http://0.0.0.0:7998/v1/models' -H 'accept: application/json' \
Response:
{
"data":[
{
"id":"nreimers/MiniLM-L6-H384-uncased",
"stats":{
"queue_fraction":0.0,
"queue_absolute":0,
"results_pending":0,
"batch_size":32
},
"object":"model",
"owned_by":"infinity",
"created":1719902031,
"backend":"torch"
}
],
"object":"list"
}
Next, let's inference against the text embedding model, to generate the following vector embeddings for the following sample text: "a sentence to encode"
curl -X 'POST' \
'http://0.0.0.0:7998/v1/embeddings' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"input": [
"a sentence to encode."
]
}'
Response:
{
"object":"embedding",
"data":[
{
"object":"embedding",
"embedding":[
-0.07419787347316742,
0.05378536880016327,
"...",
"...",
0.008502810262143612,
0.010130912065505981
],
"index":0
}
],
"model":"nreimers/MiniLM-L6-H384-uncased",
"usage":{
"prompt_tokens":21,
"total_tokens":21
},
"id":"infinity-c522de74-90d2-415d-9455-d3ea480785a0",
"created":1719903053
}
That looks promising. You may wonder if you can deploy and run Infinity within SAP AI Core for serving open-source Text Embedding Models. The answer is YES.
In this section, we will go through the technical details of bringing Infinity into SAP AI Core through BYOM (Bring Your Own Model) approach. All the sample code showed in this section could be found here.
To bring a custom inference server in SAP AI Core, there are some common requirements in SAP AI Core.
We won't go through all the steps in this blog post. For the prerequisites and initial configuration(on-boarding github repository and create an application in SAP AI Core), please refer to the prerequisites section of the sample app byom-oss-llm-ai-core.
Instead, we'll take Infinity as a sample custom inference server to deploy and run in SAP AI Core, which is already a ready-to-use inference server to serve open-source Text Embedding models, no extra server inference code is needed. Have that said, apart from bring your own inference code, we can also bring a ready-to-use inference server to SAP AI Core, as long as it is compliant with requirements mentioned above.
Here is the conceptual components diagram of Deploying and Running Infinity within SAP AI Core.
In design time, it only needs two files:
In runtime, we’ll deploy and run an Infinity server in SAP AI Core based on its serving template.
Important Note: The steps described below mainly aims to explain the process, are automated in the jupyter notebooks 00-init-config.ipynb and /infinity-emb/01-deployment.ipynb, which you can run it through, and please pay attention to its prerequisites. Alternatively, you can perform step 4 and 5 through SAP AI Core by manual.
I have prepared a dockerfile of Infinity adapted for SAP AI Core, let's walk it through.
Dockerfile:
FROM pytorch/pytorch:2.3.0-cuda12.1-cudnn8-runtime AS runtime
WORKDIR /usr/src
# Update and install dependencies
RUN apt-get update && \
apt-get install -y \
ca-certificates \
nginx \
curl && \
apt-get clean && \
rm -rf /var/lib/apt/lists/*
RUN python3 -m pip install --upgrade pip==23.2.1 && \
python3 -m pip install "infinity-emb[all]" && \
rm -rf /root/.cache/pip
EXPOSE 7997
# Adaptation for SAP AI Core
COPY run.sh /usr/src/run.sh
RUN mkdir -p /nonexistent/ && \
mkdir -p /hf-home/ && \
chown -R nobody:nogroup /nonexistent /hf-home/ && \
chmod -R 770 /nonexistent/ /hf-home/ && \
chmod +x /usr/src/run.sh
ENV HF_HOME=/hf-home
# Note: Uncomment this ENV with MODEL_NAME & URL_PREFIX if you're running Docker locally. Don't forget about the backslash \
# MODEL_NAME="nreimers/MiniLM-L6-H384-uncased"
# URL_PREFIX="/v1"
ENTRYPOINT [ "/usr/src/run.sh" ]
run.sh:
#!/bin/bash
# Set default Host to 0.0.0.0 if not already set
HOST="${HOST:-0.0.0.0}"
OPT+=" --host ${HOST}"
# Add port to options if PORT is set and --port is not already in ARG
if [ ! -z "${PORT}" ] && [[ ! "${ARG}" =~ --port ]]; then
OPT+=" --port ${PORT}"
fi
# Echo the SERVE_FILES_PATH and the options to be used
echo ${MODEL_NAME}
echo ${URL_PREFIX}
# Use set -x to print commands and their arguments as they are executed.
set -x
# Run the service with the model and the prepared options
infinity_emb --url-prefix "${URL_PREFIX}" --model-name-or-path "${MODEL_NAME}"
The code is somehow self-explained. Here just highlight some important adaptations of Infinity for SAP AI Core.
This step has been automated with /infinity-emb/01-deployment.ipynb for the sample byom-oss-llm-ai-core. Once the dockerfile is in place, next we can build the docker image and push it to docker hub with commands below:
# 0.Login to docker hub
docker login -u <YOUR_DOCKER_USER> -p <YOUR_DOCKER_ACCESS_TOKEN>
# 1.Build the docker image
docker build --platform=linux/amd64 -t docker.io/<YOUR_DOCKER_USER>/infinity:ai-core .
# 2.Push the docker image to docker hub to be used by deployment in SAP AI Core
docker push docker.io/<YOUR_DOCKER_USER>/infinity:ai-core
I have prepared a sample serving template for Infinity (infinity-template.yaml). Let's walk it through.
apiVersion: ai.sap.com/v1alpha1
kind: ServingTemplate
metadata:
name: infinity
annotations:
scenarios.ai.sap.com/description: "Run an Infinity embedding inference server on SAP AI Core"
scenarios.ai.sap.com/name: "infinity"
executables.ai.sap.com/description: "Run an Infinity embedding inference server on SAP AI Core"
executables.ai.sap.com/name: "infinity"
labels:
scenarios.ai.sap.com/id: "infinity"
ai.sap.com/version: "0.0.1"
spec:
inputs:
parameters:
- name: image
type: "string"
default: "docker.io/<YOUR_DOCKER_USER>/infinity:ai-core"
description: "Define the location of the Docker image of which you have built for Infinity following the steps."
- name: modelName
type: "string"
default: "nreimers/MiniLM-L6-H384-uncased"
description: "Define the Sentence Transformer model (MTEB) you would like to use that Infinity supports. More info: https://michaelfeil.eu/infinity/latest/"
- name: urlPrefix
type: "string"
default: "/v1"
description: "It is required for SAP AI Core to base the root of the inference server to start with /v1."
- name: portNumber
type: "string"
default: "7997"
description: "When you run a container, if you want to access the application in the container via a port number."
- name: resourcePlan
type: "string"
default: "infer.s"
description: "Resource plans are used to select resources in workflow and serving templates."
- name: minReplicas
type: "string"
default: "1"
description: "The lower limit for the number of replicas to which the autoscaler can scale down."
- name: maxReplicas
type: "string"
default: "1"
description: "The upper limit for the number of replicas to which the autoscaler can scale down."
template:
apiVersion: "serving.kserve.io/v1beta1"
metadata:
annotations: |
autoscaling.knative.dev/metric: concurrency
autoscaling.knative.dev/target: 1
autoscaling.knative.dev/targetBurstCapacity: -1
autoscaling.knative.dev/window: "10m"
autoscaling.knative.dev/scaleToZeroPodRetentionPeriod: "10m"
labels: |
ai.sap.com/resourcePlan: "{{inputs.parameters.resourcePlan}}"
spec: |
predictor:
imagePullSecrets:
- name: <YOUR_DOCKER_SECRET>
minReplicas: {{inputs.parameters.minReplicas}}
maxReplicas: {{inputs.parameters.maxReplicas}}
containers:
- name: kserve-container
image: "{{inputs.parameters.image}}"
ports:
- containerPort: {{inputs.parameters.portNumber}}
protocol: TCP
env:
- name: MODEL_NAME
value: "{{inputs.parameters.modelName}}"
- name: URL_PREFIX
value: "{{inputs.parameters.urlPrefix}}"
There are 7 input parameters, which are pretty self explanatory, of which more information you may refer to the description value of it.
As mentioned in Step 3, the serving template need to be hosted in a Github Repository for SAP AI Core. This step has been automated for the sample byom-oss-llm-ai-core in 00-init-config.ipynb. Alternatively, you can onboard a github repository through SAP AI Launchpad by manual. As a result, the associated Github Repository has been onboarded into SAP AI Core.
This step has been automated for the sample byom-oss-llm-ai-core in 00-init-config.ipynb. Alternatively, you can can create your own app and sync with its associated github repo on-boarded on step 4 through SAP AI Launchpad by manual. As a result, an application has been created, a scenario for ollama is created after synchronization.
If we have a look at the infinity scenario, it has 7 input parameters as defined in its serving template (infinity-template.yaml).
This step has been automated with 01-deployment.ipynb for the sample byom-oss-llm-ai-core. Alternatively, you can can create a configuration and start a deployment with SAP AI Launchpad. Here you have the demo recording:
In our sample btp-generative-ai-hub-use-cases/01-social-media-citizen-reporting-genai-hub, the use case is about a fictitious city called "Sagenai City" facing challenges in managing and tracking maintenance in public areas. The city wants to improve the way they handle reported issues from the citizens, by analyzing social media posts & making informed decisions and so effectively tracking & managing issues in public spaces. and output in json schema with plain prompting.
Previously, in one of the use case, to deduplicate citizen reported issues, we used one of the SAP AI Core's generative AI Text Embedding model from OpenAI - Text Embedding ADA 002, to generate vector embeddings and store it in SAP HANA Vector Engine. More info you may read about it here.
You can inference the model in Infinity with HTTP calls, which is applicable for any programming language that support http calls to a remote http server, such as JavaScript(CAP), Java(CAP), ABAP, Python etc. Here is some code snippet in Python for illustration. Please check out the full sample jupyter notebook with /infinity-emb/02-embedding.ipynb
deployment = ai_api_client.deployment.get(deployment_id)
inference_base_url = f"{deployment.deployment_url}"
endpoint = f"{inference_base_url}/v1/embeddings"
json_data = {
"input": [
"A sentence to encode."
]
}
response = requests.post(endpoint, headers=headers, json=json_data)
x = json.loads(response.content)
print(x['data'][0]['embedding'])
Results:
[-0.0741010531783104, 0.05380121245980263,...,..., -0.050218887627124786, -0.024147523567080498]
You can also inference the model in Infinity with SAP Generative AI Hub SDK, which can simplify the access to SAP Generative AI Hub for application development or integration. Please check its home page of python package as above for details. As of 20 Jun 2024, the SDK is only available as a python package, hence it is only available for python application development. Here is code snippet for illustration, please check out the full jupyter notebook infinity-emb/02-embedding-sap-genai-hub-sdk.ipynb for more detail.
The high-level flow is as follows:
pip install generative-ai-hub-sdk[langchain]
from gen_ai_hub.proxy.gen_ai_hub_proxy import GenAIHubProxyClient
GenAIHubProxyClient.add_foundation_model_scenario(
scenario_id="byom-infinity-server",
config_names="infinity*",
prediction_url_suffix="/v1/embeddings",
)
proxy_client = GenAIHubProxyClient(ai_core_client = ai_core_client)
from gen_ai_hub.proxy.native.openai import embeddings
response = embeddings.create(
input="Every decoding is another encoding.",
model_name="nreimers/MiniLM-L6-H384-uncased",
encoding_format='base64'
)
print(response.data[0].embedding)
In the following code snippet, you will find some examples on how to integrate with SAP HANA Cloud, Vector Engine.
# 1. Connect to SAP HANA Cloud Vector Engine with hana_ml library
from hana_ml import ConnectionContext
cc= ConnectionContext(
address='<HANA_CLOUD_IP>.hna0.prod-eu10.hanacloud.ondemand.com',
port='<PORT>',
user='<USER>',
password='<PASSWORD>',
encrypt=True
)
# 2. Define the Embedding method - using open source embedding model
from gen_ai_hub.proxy.native.openai import embeddings
def get_embedding(input, model="nreimers/MiniLM-L6-H384-uncased") -> str:
response = embeddings.create(
model_name ="nreimers/MiniLM-L6-H384-uncased",
input=input
)
return response.data[0].embedding
# 3. Create a table
cursor = cc.connection.cursor()
sql_command = '''CREATE TABLE SOCIAL_CITIZEN_GENAI_PROCESSEDISSUES(ID INTEGER, PROCESSOR NVARCHAR(5000), PROCESSDATE NVARCHAR(5000), PROCESSTIME NVARCHAR(5000), REPORTEDBY NVARCHAR(5000), DECISION NVARCHAR(5000), REDDITPOSTID NVARCHAR(5000), MAINTENANCENOTIFICATIONID NVARCHAR(5000), ADDRESS NVARCHAR(5000), LOCATION NVARCHAR(5000), LAT NVARCHAR(5000), LONG NVARCHAR(5000), GENAISUMMARY NVARCHAR(5000), GENAIDESCRIPTION NVARCHAR(5000), PRIORITY NVARCHAR(5000), PRIORITYDESC NVARCHAR(5000), SENTIMENT NVARCHAR(5000), CATEGORY NVARCHAR(5000), DATE DATE, TIME NVARCHAR(5000), TEXT NVARCHAR(5000), EMBEDDING NCLOB);'''
cursor.execute(sql_command)
cursor.close()
# 4. Add REAL_VECTOR column (Take note here that we define the Embedding Dimensions of 384, which matches with the Text Embedding model)
cursor = cc.connection.cursor()
sql_command = '''ALTER TABLE SOCIAL_CITIZEN_GENAI_PROCESSEDISSUES ADD (VECTOR REAL_VECTOR(384));'''
cursor.execute(sql_command)
cursor.close()
# 5. Generate embeddings from the text
import math
rows = []
for index, row in df.iterrows():
data_to_insert = row.to_dict()
text=row.TEXT
x=row.MAINTENANCENOTIFICATIONID
# check on maintenance notification id as some values are NaN
if math.isnan(x):
maintenanceNotID=0
else:
maintenanceNotID=row.MAINTENANCENOTIFICATIONID
text_vector = get_embedding(input=text)
myrow = (row['ID'], row['PROCESSOR'], row['PROCESSDATE'], row['PROCESSTIME'], row['REPORTEDBY'], row['DECISION'],
row['REDDITPOSTID'], maintenanceNotID, row['ADDRESS'], row['LOCATION'], row['LAT'],
row['LONG'], row['GENAISUMMARY'], row['GENAIDESCRIPTION'], row['PRIORITY'], row['PRIORITYDESC'], row['SENTIMENT'],
row['CATEGORY'], row['DATE'], row['TIME'], row['TEXT'], str(text_vector), str(text_vector))
rows.append(myrow)
# Bulk insert of 23 fields parameterised in rows variable
cc.connection.setautocommit(False)
cursor = cc.connection.cursor()
sql = '''INSERT INTO SOCIAL_CITIZEN_GENAI_PROCESSEDISSUES
VALUES(?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,TO_REAL_VECTOR(?));'''
try:
cursor.executemany(sql, rows)
except Exception as e:
cc.connection.rollback()
print("An error occurred:", e)
try:
cc.connection.commit()
finally:
cursor.close()
cc.connection.setautocommit(True)
# 1. Define a run vector search method
def run_vector_search(query: str, metric="COSINE_SIMILARITY", k=4):
if metric == 'L2DISTANCE':
sort = 'ASC'
else:
sort = 'DESC'
query_vector = get_embedding(input=query)
sql = '''SELECT TOP {k} "ID", "CATEGORY", "TEXT", "DATE", "LOCATION", "{metric}"("VECTOR", TO_REAL_VECTOR('{qv}')) AS SIM
FROM "DBADMIN"."SOCIAL_CITIZEN_GENAI_PROCESSEDISSUES"
ORDER BY "SIM" {sort}'''.format(k=k, metric=metric, qv=query_vector, sort=sort)
hdf = cc.sql(sql)
df_context = hdf.head(k).collect()
return df_context
# 2. Prepare a text string sample
vector_str = """📢 Urgent Report for Public Attention 📢
Dear neighbours of Sagenai,
I hope this post finds you well. I am writing today to bring to your attention a pressing issue that requires immediate action from our local authorities. 🚮
In the heart of our beautiful neighbourhood, specifically at 27-3 Victoria Rd, London, UK, we are currently facing a problem that greatly affects our daily lives: an overflowing dustbin. 🗑️ The pungent odor, unsightly sight, and the potential for vermin and health hazards pose a significant inconvenience for all of us. 🤢
I kindly request our esteemed local administration to address this matter promptly, ensuring the cleanliness and hygiene we deserve in our shared spaces. 🙏🏼 Let's work together to maintain the charm and cleanliness of our beloved Sagenai!
Thank you for your attention and support in resolving this matter.
Best regards,
Concerned Citizen 🌟
Coordinates:(51.553842239632296,0.0041263312666776075)
&#x200B;
https://preview.redd.it/8f6f8tumw"""
# 3. Perform Similarity Search
df = run_vector_search(query = vector_str, k = 10)
df
Please note that the following bonuses is part of the blog post showing you how to explore the Generative AI capabilities of SAP AI Foundation, along with proof of concepts in the form of use cases:
Bring Open-Source LLMs into SAP AI Core with Infinity:
Please refer to this manual to try out deploying and running open-source Text Embedding Models with Infinity in SAP AI Core. The source code of this sample is released under Apache 2.0 license. You should be accountable for your own choice of commercially viable open-source LLMs/LMMs/Text Embedding Models.
This blog post has explored the exciting potential of integrating open-source Text Embedding Models with SAP AI Core's Bring Your Own Model (BYOM) approach. We've delved into the technical details of deploying and running Infinity, a high-performance text embedding inference server, within your SAP AI Core environment.
We have been witnessing a fundamental transformation in some industries and functions by Generative AI, such as education, media press, software development, marketing, customer service etc. The open-source LLM community is evolving rapidly, and has a role to play in the need of data protection and privacy.
For SAP developers that need to leverage generative AI in their solutions, SAP provides Generative AI Hub as an easy access to a wild range of leading LLMs including both proprietary and open-source, which should have most of your use-cases covered.
In particular business cases where you need different open-source LLM which are not yet available in Generative AI Hub, or you have fine tune some open-source model with your own data, SAP AI Core can be used to deploy and run them with custom inference server or read-to-use open-source inference servers, such as Infinity a in a manner of your choice and your responsibility.
Disclaimer: SAP notes that posts about potential uses of generative AI and large language models are merely the individual poster's ideas and opinions, and do not represent SAP's official position or future development roadmap. SAP has no legal obligation or other commitment to pursue any course of business, or develop or release any functionality, mentioned in any post or related content on this website.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
User | Count |
---|---|
2 | |
2 | |
1 | |
1 | |
1 |