Artificial Intelligence and Machine Learning Blogs
Explore AI and ML blogs. Discover use cases, advancements, and the transformative potential of AI for businesses. Stay informed of trends and applications.
cancel
Showing results for 
Search instead for 
Did you mean: 
AronMac
Product and Topic Expert
Product and Topic Expert
4,039
NOTE: The views and opinions expressed in this blog are my own

Meta have recently made available their  Llama2 Large Language Models (LLM) for free, for commercial usage.  https://ai.meta.com/llama/

Though they may lack the breadth and depth of knowledge of  an OpenAI LLM the advantage of these open source models is that they can run on more typical Enterprise  hardware and  be fine tuned (tailored)  to your business needs.

 

Heres an example of the smallest LLama-2 7Billion Parameter model running on Hugging Face and being consumed in BTP:


Click to Enlarge


If you see the gif running it hasn't been sped up, that's how fast it is.

 

In a previous blog post I showed how easy it is to develop a basic chatbot on BTP for testing LLM's such as those provided by OpenAI:

Simplify your LLM Chatbot Demos with SAP BTP and Gradio

 

To extend the example further it can be easilly modified to work with other Open Source LLM's.

 

The first step is to work out which LLM model you wish to test use and Hugging Face is a great place to search.

Next you need to work out where you will host the model, some typical options include:

  • Google Colab   -   Rollup you sleaves. Great for testing, but can't 'officially' provide API access to a running model, though some advocate using ingress tools like ngrok

  • Hugging Face Inference Endpoint   - Super Simple Deployment of an Inference End Point

  • AWS - Realitively easy deployment with SageMaker, more flexibilit but more steps to follow

  • SAP AI Core - Enterprise Platform for AI Solutions


 

I went with Llama-2-7b-chat-hf  and choose to deploy an Inference enpoint:


Click to Enlarge


 

 

You then need to choose your prefered cloud provider and instance size:


Click to Enlarge


 

Follow the steps and deploy.   It's that easy.

If you wish to see the steps in full you can check them out here.

 

So how much does it cost to run a tiny LLM 24/7 365 days a year?

As a rough estimate you are looking at 5k - 12k USD depending on whether you run on Hugging Face or deploy on your own Hyperscaler.

 

After a 5-10 minutes it should start the instance and it will provide you with an inference only URL endpoint, as well as a example curl statement to test it.


 

e.g.
curl https://<your end point goes here> \
-X POST \
-d '{"inputs":"Whats is ABAP Cloud?"}' \
-H "Authorization: Bearer <your hf access token> \
-H "Content-Type: application/json"

Here's the result of a quick test:


Click to Enlarge


 

Fantastic..... but hardly something end users can easily consume.

 

So next I made some minor tweaks to my earlier blog post example.

In BTP I added an an additional user provided service with the Hugging Face details:
cf cups hf_llama2_inf_service -p '{"url": "https://<Hugging Face Inference endpoint>", "api_key": "<hf access token>"}'


Next I modified the requirements.txt, manifest, and added a little more code to server.py:
#requirements.txt   
#CHANGES: add huggingface library
#---------------------------------------
...
huggingface_hub



#manifest.yaml
#CHANGES: add service
#---------------------------------------
...
- hf_llama2_inf_service



#server.py
# CHANGES:
# - add hugging face functionaility
# - get hf service credentials
# - add hf inference call logic
#---------------------------------------
...
from huggingface_hub import InferenceClient
...
...
# Get the Hugging Face credentials
service_name = "hf_llama2_inf_service"
hf_api_key = env.get_service(name=service_name).credentials['api_key']
hf_llama2_inf_url = env.get_service(name=service_name).credentials['url']
print("Hugging Face API Key assigned")
...
...
elif llm == 'Llama-2-7b-chat-hf':

# Streaming Client
client = InferenceClient(hf_llama2_inf_url, token=hf_api_key)

# generation parameter
gen_kwargs = dict(
max_new_tokens=1024,
top_k=50,
top_p=0.95,
temperature=0.8,
stop_sequences=["\nUser:", "<|endoftext|>", "</s>"],
)

stream = client.text_generation(prompt, stream=True, details=True, **gen_kwargs)

# yield each generated token
for r in stream:
# skip special tokens
if r.token.special:
continue
# stop if we encounter a stop sequence
if r.token.text in gen_kwargs["stop_sequences"]:
break
# yield the generated token
print(r.token.text, end = "")
yield r.token.text
...




 

If you followed along you should be able to test it yourself.

 

Did you get it to work?    If so ..... Congratulations !!!!!

 

What next?

Should I deploy it on SAP AI Core to better integrate with your Enterprise data ?

 

I welcome your comments and suggestions below.

 

SAP notes that posts about potential uses of generative AI and large language models are merely the individual poster’s ideas and opinions, and do not represent SAP’s official position or future development roadmap. SAP has no legal obligation or other commitment to pursue any course of business, or develop or release any functionality, mentioned in any post or related content on this website.

Labels in this area