
Welcome back to our series "SAP AI Core is All You Need" 😀.
Hey there! In this blog, let's dive into AI checkpointing – a common technique that's like a safety net for your model training adventures. Imagine you're training a machine learning model, pouring in hours of computing power and heaps of data. Suddenly, bam! The GPUs hit their limit on the cluster, or the boss told us to stop because we're hogging all the resources, or sometimes there's just a disaster in the data center. All that hard work – gone in an instant? Not with checkpointing!
In this blog, you will gain hands-on experience with the following key concepts:
What's Checkpointing, Anyway? Checkpointing is like hitting "save" during a video game. It's a smart strategy that periodically saves the state of your model during training. This means capturing critical info like the model's weights, biases, and other parameters at different stages of the training journey.
Checkpointing isn't just a safety net; it's a game-changer. It saves valuable time and resources by allowing you to resume training from the last successful point. No need to redo everything from scratch. This can be a lifesaver for big, complex models that take days or weeks to train. It's also a powerful tool for keeping an eye on your model's progress. By saving checkpoints regularly, you can monitor how your model evolves over time. Spotting trends or issues early on can help you fine-tune your approach and achieve better results.
So, whether you're training a language model or teaching your AI to recognize cats from dogs, remember the magic of checkpointing. It's your safety net, your progress tracker, and your ticket to more efficient AI adventures. Happy checkpointing! 🚀
Alright, so we've established that checkpointing is a lifesaver for your AI projects. But here's the twist: we're going to take things a step further by implementing it in a separate Docker image.
Now, you might be wondering, "Why not just stick everything in one image?". That's a fair question! Here's the thing: the only answer would be "to learn more"; however, while it works, keeping things separate will offer some sweet advantages to your learning journey:
So, while keeping everything in one image might seem simpler at first, taking the "separate image" route offers a cleaner architecture, potential scalability, and a valuable learning experience on a robust platform like SAP BTP. Remember, the goal is to build robust and efficient AI models, and sometimes, a little extra planning goes a long way. So, embrace the power of separate images, and watch your machine learning projects thrive!
However, in a production environment and real-world scenarios, it's important to evaluate the most suitable approach based on your specific needs and resource constraints. Another consideration is whether this step can be integrated directly within the training workflow (using the same image, which is a more common approach) or omitted entirely (though not recommended for Language Models and Large Language Models).
If you feel confident with these steps already, feel free to skip ahead to the next blog, okay? I can still see you around! 👁 Alright, let's move forward then!
Let's dive into another interesting part of our machine learning project – defining what needs to be changed or adapted for our checkpointing setup.
As you may expect, the setup code should be adapted because it was just providing the data input for our model, but this time, it needs to provide also the language model in its last state and the BPE model (tokenizer), right? Let's see how the main.py methods gets now:
from ShakespeareanGenerator.logger import Logger
from ShakespeareanGenerator.artifact_manager import ObjectStoreArtifactManager
class Run:
def __init__(self):
self.logging = Logger()
self.obj = ObjectStoreArtifactManager()
self.prepare_data()
def prepare_data(self):
self.logging.info('START: PREPARATION STEP')
self.obj.upload_file_to_object_store()
self.logging.info('Training Data was uploaded to Object Store')
self.obj.copy_object(model_type='model')
self.logging.info('The Language Model was successfully uploaded to the object store')
self.obj.copy_object(model_type='bpe_model')
self.logging.info('The trained tokenizer (BPE) was successfully uploaded to the object store')
self.logging.info('END: PREPARATION STEP')
if __name__ == '__main__':
Run()
Yes, it has two instances of the method copy_object for each input: model and bpe_model.
self.obj.copy_object(model_type='model')
self.logging.info('The Language Model was successfully uploaded to the object store')
self.obj.copy_object(model_type='bpe_model')
self.logging.info('The trained tokenizer (BPE) was successfully uploaded to the object store')
This is implemented by the class ObjectStoreArtifactManager from artifact_manager, so let's see it:
import boto3
import requests
from ShakespeareanGenerator.logger import Logger
from ShakespeareanGenerator.parameters import ObjectStoreParameters
class ObjectStoreArtifactManager:
def __init__(self):
self.logging = Logger()
self.obj_parameters = ObjectStoreParameters()
self.s3 = self.__get_s3_connection()
self.latest_execution_id = None
def __get_s3_connection(self):
return boto3.client(
's3',
aws_access_key_id = self.obj_parameters.access_key_id,
aws_secret_access_key = self.obj_parameters.secret_access_key
)
def __get_executions(self):
response = self.s3.list_objects_v2(Bucket=self.obj_parameters.bucket_name, Prefix=self.obj_parameters.prefix_m)
unique_prefixes = set()
for obj in response['Contents']:
prefix_part = obj['Key'].split('/')[2]
unique_prefixes.add(prefix_part)
sorted_objects = sorted(response['Contents'], key=lambda x: x['LastModified'])
latest_keys = {}
for obj in sorted_objects:
prefix_part = obj['Key'].split('/')[2]
if prefix_part not in latest_keys:
latest_keys[prefix_part] = obj['Key'].split('/')[2]
self.sorted_keys = list(latest_keys.values())
def __check_model_files_exist(self):
model_files_exist = False
for model_type in ['model', 'bpe_model']:
source_key = f"{self.obj_parameters.prefix_m}{self.latest_execution_id}/{model_type}/"
response = self.s3.list_objects_v2(Bucket=self.obj_parameters.bucket_name, Prefix=source_key)
if 'Contents' in response:
model_files_exist = True
else:
model_files_exist = False
self.logging.warning('Exit the loop if any model file is missing')
break
return model_files_exist
def __get_latest_valid_execution_id(self):
if not hasattr(self, 'sorted_keys'):
self.__get_executions()
self.logging.info('Reading all the models in object store from all executions')
if not hasattr(self, 'current_index'):
self.current_index = 0
self.logging.info(f'Initial Index: {self.current_index}')
reversed_prefixes = list(map(lambda x: x, reversed(self.sorted_keys)))
for index in range(0, len(self.sorted_keys)):
self.latest_execution_id = reversed_prefixes[index]
if self.__check_model_files_exist():
return self.latest_execution_id
else:
msg = 'Files for execution ID not found. {}'.format(self.latest_execution_id)
self.logging.warning(msg)
def copy_object(self, model_type):
self.__get_latest_valid_execution_id()
model_mappings = {
'model': ('model.pkl','{}{}model.pkl'.format(
self.obj_parameters.prefix,
self.obj_parameters.INPUT_MODEL_PATH
)),
'bpe_model_vocab': ('vocab.json', '{}{}vocab.json'.format(
self.obj_parameters.prefix,
self.obj_parameters.INPUT_BPE_MODEL_PATH
)),
'bpe_model_merges': ('merges.txt', '{}{}merges.txt'.format(
self.obj_parameters.prefix,
self.obj_parameters.INPUT_BPE_MODEL_PATH
))
}
if not any(key.startswith(model_type) for key in model_mappings):
raise ValueError(f"Invalid model_type: {model_type}")
for key, (model_file_name, destination_key) in model_mappings.items():
if key.startswith(model_type):
source_key = f"{self.obj_parameters.prefix_m}{self.latest_execution_id}/{model_type}/{model_file_name}"
self.logging.info(f'FROM: {source_key} TO: {destination_key}')
self.logging.info(f'Starting copy process for {model_type}')
self.s3.copy_object(
Bucket=self.obj_parameters.bucket_name,
CopySource={'Bucket': self.obj_parameters.bucket_name, 'Key': source_key},
Key=destination_key
)
self.logging.info(f'{model_type} artifacts were updated from {self.latest_execution_id} folder to the input folders for further processing')
return self.latest_execution_id
def upload_file_to_object_store(self):
url = "<link_to_github_repository>tinyshakespeare.txt"
file_key = f"{self.obj_parameters.prefix}{self.obj_parameters.DATA_PATH + self.obj_parameters.DATA_NAME}"
try:
response = requests.get(url)
response.raise_for_status() # Raise an exception for HTTP errors
corpus = response.text
corpus = "<b>".join(corpus.split('\n'))
self.s3.put_object(
Bucket=self.obj_parameters.bucket_name,
Key=file_key,
Body=corpus.encode('utf-8')
)
self.logging.info(f"Uploaded tinyshakespeare.txt to S3 path: {file_key}")
self.logging.info(f"{self.obj_parameters.prefix_m}")
except requests.RequestException as e:
error_msg = f"Error fetching data from URL: {e}"
print(error_msg)
self.logging.error(error_msg)
except Exception as e:
error_msg = f"An unexpected error occurred: {e}"
print(error_msg)
self.logging.error(error_msg)
Alright, so we've got a class here called ObjectStoreArtifactManager. This class helps manage and interact with object storage, specifically Amazon S3, within our machine learning workflow. Let's go through what each part of this class is doing:
Here's an example of how the code will deal with S3 bucket.
This ObjectStoreArtifactManager class is designed to provide all required interactions with object storage for our machine learning workflow, handling tasks such as copying model artifacts and uploading data files. It's all about managing our resources and keeping artifacts available for resuming training.
You might already have a good idea of what to expect from the files we'll be using, right? So, we'll focus on the specific ones that have been adapted or changed for our checkpointing feature. Let's start with main.py.
import pickle
import torch
from ShakespeareanGenerator.model.language_models import ShakespeareanLanguagelModel, ModelTrainer
from ShakespeareanGenerator.parameters import TrainingParameters
from ShakespeareanGenerator.data_handler import DataHandler
from ShakespeareanGenerator.logger import Logger
class Run:
def __init__(self):
self.logging = Logger()
self.training_params = TrainingParameters()
self.check_gpu_usage()
self.prepare_data()
self.train_model()
def check_gpu_usage(self):
if torch.cuda.is_available():
self.logging.info(f"GPU is available, using GPU: {torch.cuda.get_device_name(0)}")
self.logging.info(f"Using CUDA version {torch.version.cuda}")
else:
self.logging.warning("GPU is not available, using CPU.")
def prepare_data(self):
self.logging.info('START OF EXECUTION')
self.logging.info('Get DataHandler and Model Instances')
self.data_handler = DataHandler(self.training_params.DATA_PATH)
try:
with open(self.training_params.INPUT_MODEL + 'model.pkl', 'rb') as f:
loaded_model = pickle.load(f)
self.logging.info('Loaded model for continuing training')
except FileNotFoundError:
loaded_model = None
self.logging.error('Transfer learning not possible; no model found')
self.logging.warning('Model will start from scratch')
self.model_object = ShakespeareanLanguagelModel()
model = self.model_object if loaded_model is None else loaded_model
self.model = model.to(self.training_params.device)
self.logging.info('DataHandler and Model Instantiated')
def train_model(self):
self.trainer = ModelTrainer(self.data_handler, self.model)
self.trainer.train()
self.logging.info('Model was trained successfully')
with open(self.training_params.MODEL_PATH + 'model.pkl', 'wb') as f:
pickle.dump(self.model, f)
self.logging.info('END OF EXECUTION')
if __name__ == '__main__':
Run()
As you can see, the changes are mainly related to this part, specifically:
try:
with open(self.training_params.INPUT_MODEL + 'model.pkl', 'rb') as f:
loaded_model = pickle.load(f)
self.logging.info('Loaded model for continuing training')
except FileNotFoundError:
loaded_model = None
self.logging.error('Transfer learning not possible; no model found')
self.logging.warning('Model will start from scratch')
self.model_object = ShakespeareanLanguagelModel()
model = self.model_object if loaded_model is None else loaded_model
Breaking it down, we have:
Not that difficult, right? However, you might have noticed that training_params.INPUT_MODEL is a new parameter. If you did, you're correct! This is one of the new parameters that will be introduced soon. Hang tight, and we'll jump into it very quickly.
Since we've already trained the tokenizer previously, there's no need to train them again, especially if we're using the same dataset and not fine-tuning anything. Alright, let's break down another piece of code related to our machine learning project – the load_tokenizer function. This function is all about setting up and loading a tokenizer using the SentencePieceBPETokenizer.
def load_tokenizer(self):
self.tokenizer = SentencePieceBPETokenizer.from_file(
self.training_params.INPUT_TOKENIZER_MODEL +'vocab.json',
merges_filename = self.training_params.INPUT_TOKENIZER_MODEL +'merges.txt')
Additionally, we have added self.load_tokenizer() in the encode and decode methods as well.
And that's it! Can you believe it? Everything else remains unchanged. Well, we still need to review parameters.py, but there's not much to add to it. It's been pretty straightforward so far, right?
Moving forward, we've seen that some parameters were added in the code below, specifically:
Let's start by checking the Dockerfile for ai-core-checkpointer (we don't need to revisit the ai-core-checkpointer-setup because we've been through the same Dockerfile for ai-core-training-setup, right?):
# Use the PyTorch image with CUDA 12.1 and cuDNN 8 runtime
FROM pytorch/pytorch:2.2.2-cuda12.1-cudnn8-runtime
# Install necessary system dependencies
RUN apt-get update && apt-get install -y \
python3-pip \
&& apt-get clean \
&& rm -rf /var/lib/apt/lists/*
# Create necessary directories within the Docker image
RUN mkdir -p /app/src /app/data /app/input_model /app/input_tokenizer /app/model /app/logs
# Copy files from local system to path in Docker image
COPY main.py /app/src/
COPY requirements.txt /app/src/
COPY /ShakespeareanGenerator/*.py /app/src/ShakespeareanGenerator/
COPY /ShakespeareanGenerator/model/*.py /app/src/ShakespeareanGenerator/model/
# Install Python dependencies within the Docker image
RUN pip3 install --no-cache-dir -r /app/src/requirements.txt
# Set permissions to execute anything inside the /app folder
RUN chgrp -R 65534 /app && \
chmod -R 777 /app
If you've executed our training workflows from previous blogs, you might have noticed the use of a distinct base image in the Dockerfile for training purposes. This choice stems from the utilization of pytorch/pytorch:2.2.2-cuda12.1-cudnn8-runtime, a PyTorch image tailored with CUDA 12.1 and cuDNN 8 runtime. This specialization underscores its optimization for leveraging GPU capabilities. In this manner, we ensure our training environment is finely tuned for GPU-accelerated tasks, aligning with the performance demands of modern machine learning workflows and allowing us to take full advantage of SAP AI Core Kubernetes cluster GPUs.
Now, let's focus on a couple of new directories in our setup. We've got some input directories that SAP AI Core will use during the workflow execution. These directories are set up as "input artifacts," meaning SAP AI Core will look for files there, but it won't copy anything into them - that's the job of the "setup" step we covered in the previous blog. We'll also walk you through the changes we made to the "setup" code so you can see what's happening behind the scenes.
RUN mkdir -p /app/data/
RUN mkdir -p /app/input_model/
RUN mkdir -p /app/input_tokenizer/
Next up, we've got a couple of new folders for our output. These directories are where SAP AI Core will automatically copy our outputs (thanks to the workflow template), and it will even create them if they don't exist. The copied files will be sent to the S3 "default" path.
RUN mkdir -p /app/model/
RUN mkdir -p /app/logs/
These simple mkdir commands are setting up our project structure to handle input data, models, and logs effectively within our Docker container. It's all about keeping things organized and ready for the workflow ahead.
You can download the checkpointer_template.yml file in the github repository:
apiVersion: argoproj.io/v1alpha1
kind: WorkflowTemplate
metadata:
name: "shakespeare-model-chkp"
annotations:
scenarios.ai.sap.com/name: "shakespeare-language-model"
scenarios.ai.sap.com/description: "Shakespeare Language Model"
executables.ai.sap.com/name: "Shakespeare-language-model-trainer-checkpointer"
executables.ai.sap.com/description: "Shakespeare Language Model Trainer Checkpointer Executable"
artifacts.ai.sap.com/data.kind: "dataset"
artifacts.ai.sap.com/data.description: "Tiny Shakespeare Dataset"
artifacts.ai.sap.com/model.kind: "model"
artifacts.ai.sap.com/model.description: "Trained Language Model"
artifacts.ai.sap.com/model.labels: |
{"ext.ai.sap.com/step":"train", "ext.ai.sap.com/version":"0.0.1"}
artifacts.ai.sap.com/bpe_model.kind: "model"
artifacts.ai.sap.com/bpe_model.description: "Byte-Pair Encoding Tokenizer"
artifacts.ai.sap.com/bpe_model.labels: |
{"ext.ai.sap.com/step":"train", "ext.ai.sap.com/version":"0.0.1"}
artifacts.ai.sap.com/setuplogs.kind: "other"
artifacts.ai.sap.com/setuplogs.description: "Setup Logs"
artifacts.ai.sap.com/setuplogs.labels: |
{"ext.ai.sap.com/step":"setup", "ext.ai.sap.com/version":"0.0.1"}
artifacts.ai.sap.com/logs.kind: "other"
artifacts.ai.sap.com/logs.description: "Model Training Logs"
artifacts.ai.sap.com/logs.labels: |
{"ext.ai.sap.com/step":"train", "ext.ai.sap.com/version":"0.0.1"}
labels:
scenarios.ai.sap.com/id: "shakespeare-language-model"
executables.ai.sap.com/id: "shakespeare-checkpointer"
ai.sap.com/version: "0.0.1"
spec:
imagePullSecrets:
- name: shakespeare-docker-repo
entrypoint: core
arguments:
parameters:
- name: BATCH_SIZE
description: The number of training examples processed in one iteration during training. It determines the size of each batch in the training dataset.
- name: CONTEXT_LENGTH
description: Defines the maximum length of input sequences, typically representing the number of tokens in each sequence or block of text.
- name: ITERATION_LIMIT
description: Specifies the maximum number of iterations or training steps to be performed during the training process. It controls the duration of the training loop.
- name: EVAL_FREQUENCY
description: Indicates how often model evaluation occurs during training, measured in the number of iterations or epochs between evaluations.
- name: EVAL_STEPS
description: Represents the number of evaluation steps to perform during each evaluation period. It determines the granularity of evaluation within each evaluation cycle.
- name: LEARNING_RATE
description: The rate at which the model parameters are updated during training, influencing the size of the steps taken in the parameter space to minimize the loss function.
- name: EMBEDDING_DIM
description: Determines the dimensionality of the embedding vectors used to represent tokens in the model. It impacts the expressive power of the model's embedding layer.
- name: ATTENTION_HEADS
description: Specifies the number of parallel attention heads in the multi-head attention mechanism of the model. Each head learns different aspects of the input data.
- name: NUM_LAYERS
description: Represents the total number of transformer layers in the model architecture. It controls the depth and complexity of the model.
- name: DROPOUT
description: The probability of dropping out neurons or connections between layers during training, helping prevent overfitting by randomly deactivating some units.
- name: DICTIONARY_SIZE
description: Indicates the size of the vocabulary or dictionary used by the model, representing the total number of unique tokens or words in the dataset vocabulary.
templates:
- name: core
steps:
- - name: setup
template: setup-pipeline
- - name: train
template: train-pipeline
- name: setup-pipeline
metadata:
labels:
ai.sap.com/resourcePlan: basic
outputs:
artifacts:
- name: setup_logs
globalName: setup_logs
path: /app/logs/
archive:
none:
{}
container:
image: docker.io/carlosbasto/shakespeare-checkpointer-setup:0.0.1
imagePullPolicy: Always
command: ["/bin/sh", "-c"]
args:
- python /app/src/main.py
env:
- name: BUCKET_NAME
valueFrom:
secretKeyRef:
name: object-store-credentials
key: bucket
- name: PREFIX_NAME
valueFrom:
secretKeyRef:
name: object-store-credentials
key: path_prefix
- name: ACCESS_KEY_ID
valueFrom:
secretKeyRef:
name: object-store-credentials
key: access_key_id
- name: SECRET_ACCESS_KEY
valueFrom:
secretKeyRef:
name: object-store-credentials
key: secret_access_key
- name: train-pipeline
metadata:
labels:
ai.sap.com/resourcePlan: train.l
inputs:
artifacts:
- name: data
path: /app/data/
- name: input_model
path: /app/input_model/
- name: input_tokenizer
path: /app/input_tokenizer/
outputs:
artifacts:
- name: model
path: /app/model/
globalName: model
archive:
none:
{}
- name: logs
path: /app/logs/
archive:
none:
{}
container:
image: docker.io/carlosbasto/shakespeare-checkpointer:0.0.1
imagePullPolicy: Always
command: ["/bin/sh", "-c"]
args:
- python /app/src/main.py
env:
- name: BATCH_SIZE
value: "{{workflow.parameters.BATCH_SIZE}}"
- name: CONTEXT_LENGTH
value: "{{workflow.parameters.CONTEXT_LENGTH}}"
- name: ITERATION_LIMIT
value: "{{workflow.parameters.ITERATION_LIMIT}}"
- name: EVAL_FREQUENCY
value: "{{workflow.parameters.EVAL_FREQUENCY}}"
- name: EVAL_STEPS
value: "{{workflow.parameters.EVAL_STEPS}}"
- name: LEARNING_RATE
value: "{{workflow.parameters.LEARNING_RATE}}"
- name: EMBEDDING_DIM
value: "{{workflow.parameters.EMBEDDING_DIM}}"
- name: ATTENTION_HEADS
value: "{{workflow.parameters.ATTENTION_HEADS}}"
- name: NUM_LAYERS
value: "{{workflow.parameters.NUM_LAYERS}}"
- name: DROPOUT
value: "{{workflow.parameters.DROPOUT}}"
- name: DICTIONARY_SIZE
value: "{{workflow.parameters.DICTIONARY_SIZE}}"
Place it into your own github repository and then, if you sync your application, you'll notice another file added.
Let's jump into the scenario with this new executable. Here, you'll find two executables:
In addition to the input parameters (which are the same as those for the trainer), we have both input and output parameters to consider:
Now that we have synced the scenario in SAP AI Core, let's walk through the steps to create the necessary artifacts and configure them for our execution scenario. If you need a refresher, you can check out the previous blog post for more clarity.
Are you back? Good, let’s create an artifact of type “model” for the scenario we had (shakespeare-language-model).
Now, we're going to create the BPE (Byte-Pair Encoding) model artifact. Give it a meaningful name that reflects its purpose.
Next, we'll need to specify the URL or path in S3 where the BPE model will be stored. This path should be set up during the initial workflow setup.
Since this BPE model is an input artifact for the checkpointer workflow, make sure that the corresponding folder exists in the object store (S3). The same goes for the input_model folder.
Feel free to add a label if you want – it can help organize and identify artifacts more easily. That's all for the tokenizer model setup!
Now let's repeat the process for the input_model:
Create the input_model artifact following similar steps. Once we've created these artifacts, we need to set up a configuration to map them to our specific scenario.
Next, map the inputs within the configuration to the corresponding artifacts we've just created.
After following these steps, you'll end up with a result similar to the one below, with extra inputs incorporated into the setup.
Now, you can clearly see that we've mapped 3 inputs to our configuration and obtained 2 outputs as results. Pretty neat, right? 😊
As expected, the checkpointer should resume from where the trainer left off: from 3.0334 to 3.035 for Training Loss and 3.3903 to 3.3866 for Validation Loss. It's nothing fancy, but it gives us a little more satisfaction knowing it worked. Of course, feel free to run it as many times as you want to try and achieve better results.
Anyway, let's check out what SAP AI Core has delivered to us in the S3 folders. This time, the execution ID is e6d11701c54a2597 in my case, so we should see a subfolder of ai://default/ with that ID after the execution is complete.
There we have it! And inside it, we have the saved outputs.
Alright, great! I think we've come a long way. Now it's time to switch gears and start thinking about fine-tuning our model, don't you think?
See you in the next blog 😊.
Congratulations on mastering the essentials of checkpointing and resuming training for your Shakespearean Language Model! In this blog, we've explored critical aspects of setting up our training workflow using Docker and SAP AI Core.
Let's recap what we've covered:
Now that you've built and trained our Shakespearean Language Model, it's time to dive deeper into the following advanced topics:
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
User | Count |
---|---|
22 | |
11 | |
9 | |
8 | |
7 | |
7 | |
6 | |
6 | |
6 | |
5 |