
Welcome back to our series "SAP AI Core is All You Need"!
In this blog, we're diving into the exciting world of deploying and serving AI models using SAP AI Core and KServe. Our focus? The legendary Shakespeare Language Model. If you're keen to explore how to bring advanced AI capabilities to life, join us as we build the infrastructure and the code necessary to deploy the Shakespeare Language Model for inference. To achieve this, we'll leverage the capabilities of SAP AI Core and the versatile Serving Template.
Let's make Shakespeare come alive in the world of artificial intelligence!
In this blog, you will gain practical insights into the following:
By the end of this blog, you'll have the knowledge needed to understand the tools and resources used for serving models using SAP AI Core and also it will enable you to the next blog where we'll really deploy the models.
This model is not your typical out-of-the-box solution; it relies on a complex architecture that includes custom classes and modules like TransformerBlock and FeedForward within PyTorch.
When we pickle and load our PyTorch model (let's call it model.pkl), it's not just the model's weights that get serialized. The entire structure, including these custom classes, is bundled together. This means that when you load the pickled model, your environment needs to have access to these original class definitions.
Why does it matter? Well, unlike scikit-learn models that are self-contained and don't require external dependencies during inference (actually it does, but not in the same way), our PyTorch model relies on these custom components. Think of it like needing the original blueprint to rebuild a sophisticated machine.In PyTorch, you can save either the entire model or just the model's parameter dictionary (Saving and Loading Models). The most recommended approach is to save only the model's state dictionary; however, I think it is a good homework exercise to try saving the whole model.
So, as you prepare to deploy your AI model/system, ensure that all the necessary custom classes and modules are available in your environment. This ensures that when you load and use your pickled PyTorch model, everything reconstructs nicely - just like the Bard's intricate prose 😀 .
While it's not a recommended practice for larger codebases, you can indeed serialize entire class definitions alongside your model object using libraries like cloudpickle. This approach can simplify deployment but requires careful management of dependencies.
We've finished discussing saving and loading models, so let's move on to the code breakdown for deploying our models. Yes! We now have two models, remember? The Shakespeare Language Model for text generation and the PeFT Model for Shakespeare style transfer.
By now, you're familiar with the setup - our custom classes and modules play a structural role in bringing our Shakespeare Language Model to life. Let's take a closer look at some key files that make this magic happen.
language_models.py & tokenizer.py: These files are essential for the unpickling process, ensuring that our model and tokenizer are reconstructed correctly. They hold the definitions of our custom classes and functions, ensuring everything aligns perfectly during model loading.
logger.py: This file may seem minor, but it plays a critical role. We've made some tweaks here to enhance logging functionality, ensuring smooth operation and easier troubleshooting.
parameters.py: Here's where things get interesting. We've tweaked parameters to optimize our model's performance. Stay tuned for a deeper dive into these optimizations.
generator.py: A new addition to our toolkit! This file houses the code responsible for generating text samples from our language model. We'll explore how this generator interacts with our trained model to produce those elegant Shakespearean phrases.
main.py: The heart of our inference process. Let's dissect this file to understand how sampling from our language model is orchestrated. From loading the model to generating text, this is where the magic unfolds.
Together, these files form the backbone of our AI deployment. They encapsulate everything we need from our Shakespeare Language Model and pave the way for model inference. By the end, you'll have a clear picture of how our AI is deployed using SAP AI Core (and KServe behind the scenes).
Let's break down the essential components and files needed to create our API. Our journey begins with exploring the generator.py file.
import io
import torch
import pickle
from torch.nn import functional as F
from ShakespeareanGenerator.model.tokenizer import Tokenizer
from ShakespeareanGenerator.parameters import ServingParameters
from ShakespeareanGenerator.logger import Logger
Here, we start by importing necessary dependencies for our text generation API deployment:
These imports set the stage for our text generation pipeline, enabling us to load the model, preprocess inputs, and manage serving configurations. Now, let's dive into the ModelManager class, a core component of our text generation API deployment.
class ModelManager:
def __init__(self):
self.model = None
self.model_loaded = False
self.serving_params = ServingParameters()
self.logging = Logger()
self.check_gpu_usage()
def check_gpu_usage(self):
if torch.cuda.is_available():
self.logging.info(f"GPU is available, using GPU: {torch.cuda.get_device_name(0)}")
self.logging.info(f"Using CUDA version {torch.version.cuda}")
else:
self.logging.warning("GPU is not available, using CPU.")
def load_model(self):
with open(self.serving_params.INPUT_MODEL, 'rb') as f:
self.model = CPU_Unpickler(f).load()
self.model.eval()
self.model = self.model.to(self.serving_params.device)
self.logging.info(f"Model loaded and sent to {self.serving_params.device}")
self.model_loaded = True
def is_model_loaded(self):
return self.model_loaded
The ModelManager takes care of loading and managing the model, making sure our text generation API has everything it needs to create Shakespearean text with SAP AI Core.
Next, let's dive into the Generator class, which is responsible for generating text using our loaded language model. This class is designed to handle the text generation process using our loaded language model. Let's break down how it works, including some key concepts like temperature, top_k, and top_p.
class Generator:
def __init__(self, model_manager, max_tokens, temperature, top_k, top_p):
self.max_tokens = max_tokens
self.temperature = temperature
self.top_k = top_k
self.top_p = top_p
self.tokenizer = Tokenizer()
self.model_manager = model_manager
def __sample_from_model(self, index):
self.model = self.model_manager.model
for _ in range(self.max_tokens):
try:
current_index = index[:, -self.model.position_embeddings.weight.shape[0]:]
logits, _ = self.model(current_index)
scaled_logits = (lambda l, t: l / t if t > 0.0 else l)(logits[:, -1, :], self.temperature)
probs = F.softmax(scaled_logits, dim=-1)
if self.top_k > 0:
probs_value, probs_indices = torch.topk(probs, self.top_k, dim=-1)
filtered_probs = probs.clone().fill_(0.0)
filtered_probs.scatter_(dim=-1, index=probs_indices, src=probs_value)
probs = filtered_probs / torch.sum(filtered_probs, dim=-1, keepdim=True)
sorted_probs, sorted_indices = torch.sort(probs, descending=True)
cumulative_probs = torch.cumsum(sorted_probs, dim=-1)
sorted_indices_to_remove = cumulative_probs > self.top_p
if torch.any(sorted_indices_to_remove):
cutoff_idx = torch.where(sorted_indices_to_remove)[1][0]
indices_to_remove = sorted_indices[:, cutoff_idx + 1 :]
probs.scatter_(dim=-1, index=indices_to_remove, value=0.0)
probs = probs / torch.sum(probs, dim=-1, keepdim=True)
next_index = torch.multinomial(probs, num_samples=1)
index = torch.cat((index, next_index), dim=1)
except Exception as e:
self.model_manager.logging.error(f"Error during text generation: {str(e)}")
raise
return index
def post_process_text(self, generated_text):
cleaned_text = generated_text.replace("<s>", "").replace("</s>", "").replace("<b>", "").strip()
return cleaned_text
with torch.inference_mode():
def generate(self):
if not self.model_manager.is_model_loaded():
self.model_manager.load_model()
try:
idx = torch.full((1, 1), 4, dtype=torch.long, device=self.model_manager.serving_params.device)
completion = self.tokenizer.decode(self.__sample_from_model(idx)[0].tolist())
self.length = len(self.tokenizer.encode(completion).ids)
self.model_manager.logging.info(f"Text generated successfully with length: {self.length}")
self.model_manager.logging.info(f"With max tokens set to: {self.max_tokens}")
self.model_manager.logging.info(f"With temperature set to: {self.temperature}")
self.model_manager.logging.info(f"With top k set to: {self.top_k}")
self.model_manager.logging.info(f"With top p set to: {self.top_p}")
return completion
except Exception as e:
self.model_manager.logging.error(f"Error during text generation: {str(e)}")
raise
Parameters:
Sampling:
In the __sample_from_model method, the main idea is to apply various sampling techniques to generate text from the model.
Steps:
Post Processing:
Generate:
The Generator class is designed to generate text using a pre-trained language model. It handles everything from setting up parameters, sampling tokens, cleaning up the text, and logging the process. Cool, hun?
Now it's time to define how the model will be consumed. One common approach is through API generation. APIs provide a flexible and standardized way to interact with the model, allowing various applications and users to access its functionality without needing to understand the underlying code. This approach is particularly useful for integrating the model into web services, mobile apps, or other systems that require real-time or on-demand text generation. And that's exactly what we'll do!
Before we jump into the code, let's introduce the main components that will be used:
This Python script sets up a simple web server using Flask to generate text in the style of Shakespeare (you don't have to use Flask, you may use another python library if you would like to, ok?). It makes use of some custom classes and modules, and here's how it works:
Imports and Setup
from flask import Flask, request, jsonify
from ShakespeareanGenerator.generator import Generator, ModelManager
from ShakespeareanGenerator.parameters import ServingParameters
from ShakespeareanGenerator.logger import Logger
Initialize Flask App
app = Flask(__name__)
app.json.sort_keys = False
Initialize Custom Classes
model_manager = ModelManager()
logging = Logger()
Load Model Before Handling Requests
def load_model():
try:
if not model_manager.is_model_loaded():
model_manager.load_model()
else:
logging.info("Model already loaded")
except Exception as e:
logging.error(f"Error loading model: {str(e)}")
raise
@app.before_request
def initialize():
load_model()
Text Generation Endpoint
.route('/v2/generate', methods=["POST"])
def generate_text():
data = request.get_json()
max_tokens = int(data.get('max_tokens', 300))
temperature = float(data.get('temperature', 1.0))
top_k = int(data.get('top_k', 0))
top_p = float(data.get('top_p', 0.9))
generator = Generator(model_manager, max_tokens, temperature, top_k=top_k, top_p=top_p)
generated_text = generator.generate()
processed_text = generator.post_process_text(generated_text)
lines = [line.strip() for line in processed_text.split('.') if line.strip()]
response = {
'generated_text': lines,
'model_details': {
'model_name': 'shakespeare-language-model',
'temperature': generator.temperature,
'length': generator.length,
'top_k': generator.top_k,
'top_p': generator.top_p,
}
}
return jsonify(response)
In this part, we define an API endpoint ('/v2/generate') where you can send POST requests to trigger the text generation process. Simply include query parameters like max_tokens, temperature, top_k, and top_p to customize the generated text.
Once the local server is up and running (thanks to app.run()), you can access your text generation API using the tool of your choice for that, such curl:
curl -X POST http://localhost:9001/v2/generate -H "Content-Type: application/json" -d '{"max_tokens": 30, "temperature": 0.5, "top_k": 0, "top_p": 0.9}'
Easy peasy! But let's talk a little bit more about testing it locally which will be a very common practice on your experiment cycles.
So, you've built your text generation app with Flask and you're eager to test it out before deploying it with SAP AI Core. No worries! Let's walk through how you can easily test your app locally using Docker and make quick fixes if needed.
To start testing locally, follow these steps:
STEP 1: Run the Local Image (Container)
Assuming you've built a Docker image for your Flask app, you can run it locally using Docker. Open your terminal and run:
docker run -p 9001:9001 -d your-image-name
This command starts a Docker container based on your image, mapping port 9001 of the container to 9001 on your localhost (-p 9001:9001). The -d flag runs the container in detached mode (in the background).
STEP 2: Test your app
Open your web browser or use a tool like curl or Postman to send requests to your app running locally:
In addition, "GET" requests are used to retrieve data, while "POST" requests are used to send data to the server. For the text generation to work as intended, input parameters need to be sent, which is typically done via a "POST" request.
However, if you want to test the endpoint in the browser with a simple "GET" request (for testing purposes or to return some default generated text), you can add a "GET" method to the endpoint.
Now you're all set to test and iterate on your text generation app locally. Feel free to experiment with different parameters, make changes to your code, and see the results in real-time.
In MLOps, maintaining observability and understanding your deployed models' behavior is key to success. Logging, as demonstrated in the Logger class below, plays a very important role in achieving these goals.
import logging
import boto3
import threading
import tempfile
from ShakespeareanGenerator.parameters import LogParameters
class Logger:
def __init__(self):
self.log_params = LogParameters()
self.logger = logging.getLogger(__name__)
self.logger.setLevel(logging.INFO)
self.temp_file = tempfile.NamedTemporaryFile(mode='a', delete=False)
self.file_handler = logging.FileHandler(self.temp_file.name)
self.file_handler.setFormatter(logging.Formatter('%(asctime)s | %(name)s → %(levelname)s: %(message)s'))
self.logger.addHandler(self.file_handler)
self.s3 = self.__get_s3_connection()
self.upload_logs_to_s3()
def __get_s3_connection(self):
return boto3.client(
's3',
aws_access_key_id=self.log_params.access_key_id,
aws_secret_access_key=self.log_params.secret_access_key
)
def upload_logs_to_s3(self):
try:
# Read logs from the temporary file
with open(self.temp_file.name, 'r') as f:
log_body = f.read().strip()
if log_body:
file_key = self.log_params.log_prefix + self.log_params.LOG_NAME
self.s3.put_object(
Bucket=self.log_params.bucket_name,
Key=file_key,
Body=log_body.encode('utf-8')
)
else:
self.logger.info("No logs to upload.")
except Exception as e:
self.logger.error(f"Error uploading log to S3: {e}")
# Reschedule the timer for the next upload
self.schedule_next_upload()
def schedule_next_upload(self):
# Create a new timer for the next upload after the specified interval
self.upload_timer = threading.Timer(self.log_params.upload_interval, self.upload_logs_to_s3)
self.upload_timer.start()
def log(self, level, message):
getattr(self.logger, level)(message)
def info(self, message):
self.log('info', message)
def warning(self, message):
self.log('warning', message)
def error(self, message):
self.log('error', message)
def critical(self, message):
self.log('critical', message)
As we are leveraging logging effectively, as demonstrated by the Logger class, you can enhance the observability of your AI models in production. Remember, good logging practices are essential for maintaining reliable and performant MLOps workflows.
See you in the next blog 😉.
Congratulations on taking the first step into deploying AI models with SAP AI Core! In this blog, we explored how to bring the Shakespeare Language Model to life using SAP AI Core and KServe.
Let's recap what we've covered:
Introduction to SAP AI Core and KServe: We introduced the foundational concepts behind deploying and serving AI models using SAP AI Core and KServe.
Deploying AI Models: We learned the importance of integrating custom classes and modules, focusing on the unique architecture of the Shakespeare Language Model.
Code Breakdown: We explored critical files and their roles in making the Shakespeare Language Model work, including detailed explanations of key components like the generator and main files.
Building a Text Generation API: We set up and ran a Flask app to generate Shakespearean text, providing step-by-step instructions and practical examples.
Logging in MLOps: We understood the crucial role of logging for monitoring and troubleshooting in machine learning operations.
Now that we've laid the foundation for serving the AI models, stay tuned for the upcoming blogs in this series, where we'll explore how to deploy and enhance our model using SAP AI Core:
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
User | Count |
---|---|
12 | |
12 | |
11 | |
11 | |
10 | |
8 | |
8 | |
7 | |
7 | |
7 |