NOTE: The views and opinions expressed in this blog are my own
In several earlier blogs I demonstrated how a simple Demo/PoC chatbot consuming Open AI API's and Hugging Face API's could be built in python and deployed on BTP with less than 100 lines of code:
Simplify your LLM Chatbot Demos with SAP BTP and Gradio
LLAMA2: Testing a Tiny LLM with Hugging Face & SAP BTP
While interesting the question of Enterprise security was left open.
In this blog I will extend this simple app further to consume LLM's provided by SAP AI CORE and SAP's new Generative AI functionality, which provides a more secure Enterprise grade offering
If you follow the earlier blogs and add the additional logic, provide here, the simple app will now have access to 4 different LLM's as illustrated below:
The new LLM's covered in this blog require that you have entitlements and access to SAP AI CORE with the EXTENDED plan.
The 2 LLM's that should be deployed as a pre-requisites to this blog are:
For a great overview of how to install Ollama and Phi-2 on AI Core see:
It's Christmas! Ollama+Phi-2 on SAP AI Core
Once you have deployed these you shoud see in SAP AI Launchpad 2 deployments running:
Assuming we are running the Python app in the same BTP account that is running AI CORE we can simply bind AI CORE to the app with small ammendments to the manifest.yaml
services:
- openai_service
- hf_llama2_inf_service
#NEW
- ai-core-extended
To simplify the call of the python app to AI CORE we need to add SAP AI Core SDK library to the requirements:
gradio==4.14.0
cfenv
openai
huggingface_hub
#NEW
ai-api-client-sdk
In server.py new logic is required to access SAP AI CORE , determine the deployment path info and then have custom chat handling logic for the 2 new LLM API's.
The updated server.py is:
import os, cfenv
import gradio as gr
import json, time, random
import re
#SAP AI CORE
from ai_api_client_sdk.ai_api_v2_client import AIAPIV2Client
from openai import OpenAI
from typing import Iterator #Stream results
from huggingface_hub import InferenceClient
port = int(os.environ.get('PORT', 7860))
print("PORT:", port)
# Define the regular expression pattern to extract interference path
pattern = re.compile(r'/v2(.*)')
# Create a Cloud Foundry environment object
env = cfenv.AppEnv()
llm_models = [ ]
# Check if the app is running in Cloud Foundry
if env.app:
try:
# Get the Open AI credentials
service_name = "openai_service"
openai_api_key = env.get_service(name=service_name).credentials['api_key']
os.environ['OPENAI_API_KEY'] = openai_api_key
OpenAIclient = OpenAI(
api_key=os.environ['OPENAI_API_KEY'], # this is also the default, it can be omitted
)
print("OpenAI API Key assigned")
llm_models.append("OpenAI")
# Get the Hugging Face Inference Endpoint details
service_name = "hf_llama2_inf_service"
hf_api_key = 'incomplete'
hf_llama2_inf_url = 'incomplete'
print("Hugging Face API Key assigned")
llm_models.append("Llama-2-7b-chat-hf")
#get ai-core service and deployment details
service_name = "ai-core-extended"
ai_core_extended_service = env.get_service(name=service_name)
ai_api_client = AIAPIV2Client(
base_url= ai_core_extended_service.credentials['serviceurls']['AI_API_URL'] + "/v2", # The present AI API version is 2,
auth_url= ai_core_extended_service.credentials['url'] + "/oauth/token",
client_id= ai_core_extended_service.credentials['clientid'],
client_secret= ai_core_extended_service.credentials['clientsecret'],
resource_group= 'default'
)
aic_scenarios = [{"id":'ollama-server'}, {"id":'foundation-models'}]
for scenario in aic_scenarios:
deployments = ai_api_client.rest_client.get(
path="/lm/deployments",
headers={ "AI-Resource-Group":"default"},
params= {"scenarioId" : scenario['id']}
)
for deployment in deployments ['resources']:
# Use the findall method to extract the path
matches = pattern.findall(deployment['deployment_url'])
# Extracted path will be the first (and only) element in the matches list
if matches:
scenario['path'] = matches[0]
llm_models.append("sap-ai-core-" + scenario['id'] )
else:
print("Path not found.")
except cfenv.AppEnvError:
print(f"The service '{service_name}' is not found.")
else:
print("The app is not running in Cloud Foundry.")
prompt = " "
def generate_response(llm, prompt)-> Iterator[str]:
outputs = []
global ai_api_client
global aic_scenarios
if llm == 'OpenAI':
response = OpenAIclient.chat.completions.create(
model="gpt-3.5-turbo",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": prompt}
]
)
yield response.choices[0].message.content
elif llm == 'Llama-2-7b-chat-hf':
# Streaming Client
client = InferenceClient(hf_llama2_inf_url, token=hf_api_key)
# generation parameter
gen_kwargs = dict(
max_new_tokens=1024,
top_k=50,
top_p=0.95,
temperature=0.8,
stop_sequences=["\nUser:", "<|endoftext|>", "</s>"],
)
stream = client.text_generation(prompt, stream=True, details=True, **gen_kwargs)
# yield each generated token
for r in stream:
# skip special tokens
if r.token.special:
continue
# stop if we encounter a stop sequence
if r.token.text in gen_kwargs["stop_sequences"]:
break
# yield the generated token
print(r.token.text, end = "")
yield r.token.text
# Ollama Server expected to be running Phi-2
elif llm == 'sap-ai-core-ollama-server':
for scenario in aic_scenarios:
if scenario["id"] == 'ollama-server':
relevant_path = scenario["path"]
break
streaming = False
response = ai_api_client.rest_client.post(
path= relevant_path + "/v1/api/chat",
headers={ "AI-Resource-Group":"default"},
body = {
"model": "phi",
"messages": [ {
"role": "user",
"content": prompt
} ],
"stream": False
}
)
if streaming:
print('Streaming response logic not implemented')
else:
print(response)
yield response['message']['content']
#Azure OPEN AI - gpt-35-turbo - api-version=2023-05-15
elif llm == 'sap-ai-core-foundation-models':
for scenario in aic_scenarios:
if scenario["id"] == 'foundation-models':
relevant_path = scenario["path"]
break
streaming = False
response = ai_api_client.rest_client.post(
path= relevant_path + "/chat/completions?api-version=2023-05-15",
headers={ "AI-Resource-Group":"default"},
#resource_group="default",
body = {
"messages": [
{
"role": "user",
"content": prompt
}
],
#"max_tokens": 100, #gives error with api accepted in REST
"temperature": 0.0,
#"frequency_penalty": 0, #gives error with api accepted in REST
#"presence_penalty": 0, #gives error with api accepted in REST
"stop": "null",
"stream" : streaming
}
)
if streaming:
print('Streaming response logic not implemented')
else:
print(response)
yield response["choices"][0]["message"]["content"]
else:
yield "Unknown LLM, please choose another and retry."
def llm_chatbot_function(llm, input, history) -> Iterator[list[tuple[str, str]]]:
history = history or []
my_history = [entry for sublist in history for entry in sublist]
my_history.append(input)
my_input = ' '.join(my_history)
generator = generate_response(llm, my_input)
try:
first_response = next(generator)
history.append((input, first_response)) # Ensure that the history entry is a tuple
yield history, history
except StopIteration:
history.append((input, '')) # Ensure that the history entry is a tuple
yield history, history
for chunk in generator:
history[-1] = (history[-1][0], history[-1][1] + chunk) # Ensure that the history entry is a tuple
yield history, history
def create_llm_chatbot():
with gr.Blocks(analytics_enabled=False) as interface:
#if 1 == 1:
with gr.Column():
top = gr.Markdown("""<h1><center>LLM Chatbot</center></h1>""")
llm = gr.Dropdown(
llm_models, label="LLM Choice", info="Which LLM should be used?" , value="OpenAI"
)
chatbot = gr.Chatbot()
state = gr.State()
question = gr.Textbox(show_label=False, placeholder="Ask me a question and press enter.") #.style(container=False)
with gr.Row():
summbit_btn = gr.Button("Submit")
clear = gr.ClearButton([question, chatbot, state ])
question.submit(llm_chatbot_function, inputs=[llm, question, state], outputs=[chatbot, state])
summbit_btn.click(llm_chatbot_function, inputs=[llm, question, state], outputs=[chatbot, state])
return interface
llm_chatbot = create_llm_chatbot()
if __name__ == '__main__':
#llm_chatbot.launch(server_name='0.0.0.0',server_port=port)
llm_chatbot.queue(max_size=20).launch(server_name='0.0.0.0',server_port=port) #Queue needs Share = True share=True, debug=True
All going well, once you push the app to BTP CF the updated Gradio app should now have 4 LLM's:
Now lets test them.
First lets test the more powerful foundation model (Azure Open AI):
Perfect....... is it?
Now lets test the tiny Custom LLM (Phi-2) running in SAP AI CORE:
Correct? I'm not so sure... Perhaps it's read less books than OpenAI's LLM?
For more discussions on what is the "correct" answer checkout 2+2=5
For more comprehensive examples on how to build Single tenant and Multitenant apps for Customers and Partners leveraging SAP AI Core and the latest SAP Generative AI enhancements then checkout:
Retrieval Augmented Generation with GenAI on SAP BTP
So where to next for this simple app? I think I heard somewhere that SAP Hana Cloud has a new Vector engine design to store embeddings which can help improve LLM searches. 😉
I welcome your additional thoughts and comments below.
SAP notes that posts about potential uses of generative AI and large language models are merely the individual poster’s ideas and opinions, and do not represent SAP’s official position or future development roadmap. SAP has no legal obligation or other commitment to pursue any course of business, or develop or release any functionality, mentioned in any post or related content on this website.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
User | Count |
---|---|
27 | |
25 | |
19 | |
14 | |
13 | |
11 | |
10 | |
9 | |
8 | |
6 |