Welcome to the first installment of our series "SAP AI Core is All You Need"!
In this blog, titled "Building your Own Language Model with Transformers", we'll dive into the amazing capabilities of Transformers – the architecture that powers models like GPT (you can also check the first paper on that: Improving Language Understanding by Generative Pre-Training). We'll guide you through the process of building a (your) language model from scratch using this cutting-edge technology. Without further ado, let's get started!
In this blog, you will gain hands-on experience with the following key concepts:
Get ready to dive into a series of blog posts where we'll unlock the amazing potential of SAP AI Core and SAP AI Launchpad. We'll explore a rich collection of components and ideas to fuel your AI adventures. And here’s a fun fact: our series title is a playful nod to the groundbreaking paper "Attention is All You Need". Why? Because we'll be using those same principles to build our very own language model. Ready?
First, let's understand what "Transformers" are. The best place to start is the paper "Attention is All You Need" as we mentioned before, authored by Google in 2017. This paper presented a novel approach to handling sequential data by replacing traditional recurrent neural networks (RNNs) and long short-term memory networks (LSTMs) with a fully attention-based model.
Now, we’ll dive into this architecture and show you how SAP AI Core can simplify building, training, and deploying these models on a Kubernetes cluster running on GPU. Exciting, right?
So, what are Transformers? They're a type of deep learning model designed to process sequential data, like text, more effectively than previous models such as RNNs and LSTMs. Transformers are exceptional at capturing long-range dependencies and context, making them promising for tasks like translation, text generation, and language understanding. The magic lies in the attention mechanism, which lets the model dynamically focus on different parts of the input sequence. But, before we move forward, it is important to notice that we can use transformer architecture in different ways:
The image provides a visual comparison of three types of transformer architectures: Encoder-Only models like BERT, which are used for tasks such as sentiment analysis; Encoder-Decoder models like BART and T5, which are used for sequence-to-sequence tasks like translation; and Decoder-Only models like GPT, which are used for generative tasks and feature an autoregressive property - that is the one we'll be building from now on.
SAP AI Core is a handy service in the SAP Business Technology Platform that manages and runs your AI projects in a standardized, scalable way, without being tied to any specific cloud provider. It smoothly integrates with your SAP solutions, and you can easily use any AI function with open-source frameworks. SAP AI Core takes care of the full lifecycle management of AI scenarios. Plus, you can tap into generative AI capabilities and manage prompts via the generative AI hub. By leveraging SAP AI Core, you can speed up the entire process of building and managing transformers. Here’s how SAP AI Core can enhance your workflow:
Have you ever wondered why building AI models from scratch might not always be the best route, especially when there are fantastic pre-trained models readily available, like those you'll find on the SAP GenAI Hub? Let's discuss why creating your own language model can be an exciting and fulfilling experience. Here are some of the main reasons to consider building your own:
As we dive into building our Shakespearean language model, we've got some key classes in our toolkit that handle the most important tasks like attention, training, and logging metrics. We'll explore each class throughout this blog to understand how they compose a decoder-only Transformer (similar to the used in GPT2 as described on the paper "Language Models are Unsupervised Multitask Learners"). However, we need to start from somewhere, right? And, this is going to be the attention mechanism.
The “Scaled Dot-Product Attention” is the core of the attention mechanism that involves the computation of attention scores between query (Q) and key (K) vectors. The process can be broken down into the following steps:
Query, Key, and Value Matrices
Calculating Attention Scores
Softmax
Weighted Sum of Values
class Head(nn.Module):
def __init__(self, head_size):
super().__init__()
self.key = nn.Linear(training_params.embedding_dim, head_size, bias=False)
self.query = nn.Linear(training_params.embedding_dim, head_size, bias=False)
self.value = nn.Linear(training_params.embedding_dim, head_size, bias=False)
self.register_buffer('tril', torch.tril(torch.ones(training_params.context_length, training_params.context_length)))
self.dropout = nn.Dropout(training_params.dropout)
def __compute_weights(self, x):
B, T, C = x.shape
k = self.key(x)
q = self.query(x)
weights = q @ k.transpose(-2, -1) * (k.shape[-1] ** -0.5)
weights = weights.masked_fill(self.tril[:T, :T] == 0, float('-inf'))
weights = F.softmax(weights, dim=-1)
weights = self.dropout(weights)
return weights
def forward(self, x):
weights = self.__compute_weights(x)
v = self.value(x)
out = weights @ v
return out
Well, that's it for attention mechanism. If you need a more visual explanation, maybe Attention in transformers, visually explained | Chapter 6, Deep Learning by 3Blue1Brown can help you with. That's a great source to go for. Anyway, what next? Maybe you might thing that putting several heads to run in parallel would give the model the ability to "see by many perspectives", right? This is what Multi-Head Attention does.
Multi-head attention involves running multiple attention mechanisms, or "heads," in parallel. Each head operates on a different linear projection of the input, and the results are concatenated and transformed to produce the final output. This allows the model to jointly attend to information from different representation subspaces at different positions.
Linear Projections
Scaled Dot-Product Attention
Concatenation and Linear Transformation
class MultiHeadAttention(nn.Module):
def __init__(self, num_heads, head_size):
super().__init__()
self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])
self.projection = nn.Linear(head_size * num_heads, training_params.embedding_dim)
self.dropout = nn.Dropout(training_params.dropout)
def forward(self, x):
head_outputs = [head(x) for head in self.heads]
out = torch.cat(head_outputs, dim=-1)
out = self.dropout(self.projection(out))
return out
Multi-head attention provides rich representations by capturing different aspects of the input data with each head focusing on various parts of the sequence, enhances computational efficiency through parallelization, and improves generalization by offering diverse perspectives for better performance across different tasks and datasets.
In the transformer architecture, each layer of the encoder and decoder contains a position-wise feed-forward network, applied independently to each position in the sequence.
class FeedForward(nn.Module):
def __init__(self, embedding_dim):
super().__init__()
self.ffnet = nn.Sequential(
nn.Linear(embedding_dim, 4 * embedding_dim),
nn.ReLU(inplace=True),
nn.Linear(4 * embedding_dim, embedding_dim),
nn.Dropout(training_params.dropout)
)
def forward(self, x):
return self.ffnet(x)
The feed-forward network (FFN) is also a important component in the transformer architecture that provides additional transformation and learning capacity ("time to think") to each layer. Unlike traditional RNNs, which apply the same weights to every position in the sequence, the FFN applies a unique set of transformations to each position independently. This enhances the model’s ability to capture complex patterns in the data.
Each transformer block consists of a multi-head self-attention mechanism followed by a feed-forward network, with layer normalization and residual connections around each sub-layer.
class TransformerBlock(nn.Module):
def __init__(self, embedding_dim, num_heads):
super().__init__()
head_size = embedding_dim // num_heads
self.self_attn = MultiHeadAttention(num_heads, head_size)
self.feed_forward = FeedForward(embedding_dim)
self.layer_norm1 = nn.LayerNorm(embedding_dim)
self.layer_norm2 = nn.LayerNorm(embedding_dim)
def forward(self, x):
attention_output = x + self.self_attn(self.layer_norm1(x))
output = attention_output + self.feed_forward(self.layer_norm2(attention_output))
return output
The transformer block integrates the key mechanisms that enable transformers to effectively process sequential data:
This class represents the full transformer model which we are calling as ShakespeareanLanguageModel, combining embeddings, multiple transformer blocks, and the output layer.
class ShakespeareanLanguagelModel(nn.Module):
def __init__(self):
super().__init__()
self.embeddings = nn.Embedding(training_params.dictionary_size, training_params.embedding_dim)
self.position_embeddings = nn.Embedding(training_params.context_length, training_params.embedding_dim)
self.transformer_blocks = nn.Sequential(
*[TransformerBlock(training_params.embedding_dim, num_heads=training_params.attention_heads) for _ in range(training_params.num_layers)]
)
self.layer_norm = nn.LayerNorm(training_params.embedding_dim)
self.output = nn.Linear(training_params.embedding_dim, training_params.dictionary_size)
self.apply(self._init_weights)
def _init_weights(self, module):
if isinstance(module, nn.Linear) or isinstance(module, nn.Embedding):
nn.init.normal_(module.weight, std=0.02)
if hasattr(module, 'bias') and module.bias is not None:
nn.init.zeros_(module.bias)
def forward(self, index, targets=None):
B, T = index.shape
token_embeddings = self.embeddings(index)
position_embeddings = self.position_embeddings(torch.arange(T, device=index.device))
x = token_embeddings + position_embeddings
x = self.transformer_blocks(x)
x = self.layer_norm(x)
logits = self.output(x)
if targets is None:
loss = None
else:
B, T, C = logits.shape
logits = logits.view(B * T, C)
targets = targets.view(B * T)
loss = F.cross_entropy(logits, targets)
return logits, loss
The full transformer model combines several key components to process and generate sequences effectively:
As we are integrating these components, the model can effectively understand and generate text, making it suitable for a variety of natural language processing tasks, including language modeling, translation, and text generation (our approach). The use of SAP AI Core further enhances the model’s training and deployment capabilities, allowing for efficient handling of large-scale data and computational resources and you'll see it on the next blogs.
Now that you’ve explored the decoder-only transformer model, named the "Shakespearean Language Model", you’re equipped with everything needed to generate Shakespearean text. Exciting, isn’t it? All that’s left is to walk through the implementation steps. Let’s get started!
Now that we've explored the potential of building our own language model, let's delve into the code! Python is our trusty companion in this journey, and its libraries offer powerful tools for working with AI and natural language processing so you may need to install some of them, specially, PyTorch.
Another thing to mention here is that we're creating transformers from scratch purely for educational purposes. In real-world scenarios, you can rely on libraries that offer many resources to speed up the development of your architectures such as Hugging Face, PyTorch or TensorFlow. However, we'll start from scratch so you can gain a better understanding of how everything works behind the scenes.
In our journey to explore the world of language modeling, we're excited to introduce the implementation of a Shakespearean language model using Python. This script, main.py, encapsulates the key steps involved in training and deploying our custom model. Now, we'll be discussing the code itself and the Transformers architecture. Some portions of this code will be revisited in upcoming blogs to clarify their usage and demonstrate their relevance to SAP AI Core. Let's go!
The main.py script serves as the backbone of our language modeling project, integrating various components to build and train the Shakespearean language model.
import pickle
import torch
from ShakespeareanGenerator.model.language_models import ShakespeareanLanguagelModel, ModelTrainer
from ShakespeareanGenerator.parameters import TrainingParameters
from ShakespeareanGenerator.data_handler import DataHandler
from ShakespeareanGenerator.logger import Logger
class Run:
def __init__(self):
self.logging = Logger()
self.training_params = TrainingParameters()
self.check_gpu_usage()
self.prepare_data()
self.train_model()
def check_gpu_usage(self):
if torch.cuda.is_available():
self.logging.info(f"GPU is available, using GPU: {torch.cuda.get_device_name(0)}")
self.logging.info(f"Using CUDA version {torch.version.cuda}")
else:
self.logging.warning("GPU is not available, using CPU.")
def prepare_data(self):
self.logging.info('START OF EXECUTION')
self.logging.info('Get DataHandler and Model Instances')
self.data_handler = DataHandler(self.training_params.DATA_PATH)
self.model_object = ShakespeareanLanguagelModel()
self.model = self.model_object.to(self.training_params.device)
self.logging.info('DataHandler and Model Instantiated')
def train_model(self):
self.trainer = ModelTrainer(self.data_handler, self.model)
self.trainer.train()
self.logging.info('Model was trained successfully')
with open(self.training_params.MODEL_PATH + 'model.pkl', 'wb') as f:
pickle.dump(self.model, f)
self.logging.info('END OF EXECUTION')
if __name__ == '__main__':
Run()
Essentially, the model will require parameters for three main purposes: to make the model training pipeline "tunable" so that anyone can easily adjust its parameters and conduct further experiments, to establish the paths for input and output in the code, and to manage credentials for services such as SAP AI Object Store and SAP HANA. This class, defined in parameters.py, encapsulates the foundational parameters that drive the model training process.
# Import necessary libraries
import os
import torch
class TrainingParameters:
def __init__(self):
self.batch_size = int(os.environ.get('BATCH_SIZE'))
self.context_length = int(os.environ.get('CONTEXT_LENGTH'))
self.iteration_limit = int(os.environ.get('ITERATION_LIMIT'))
self.eval_frequency = int(os.environ.get('EVAL_FREQUENCY'))
self.eval_steps = int(os.environ.get('EVAL_STEPS'))
self.learning_rate = float(os.environ.get('LEARNING_RATE'))
self.embedding_dim = int(os.environ.get('EMBEDDING_DIM'))
self.attention_heads = int(os.environ.get('ATTENTION_HEADS'))
self.num_layers = int(os.environ.get('NUM_LAYERS'))
self.dropout = float(os.environ.get('DROPOUT'))
self.dictionary_size = int(os.environ.get('DICTIONARY_SIZE'))
self.device = 'cuda' if torch.cuda.is_available() else 'cpu'
self.DATA_PATH = '/app/data/tinyshakespeare.txt'
self.MODEL_PATH = '/app/model/'
self.TOKENIZER_MODEL_PATH = '/app/tokenizer/'
self.LOG_PATH = '/app/logs/'
self.LOG_NAME = 'train_logs.log'
With this class, our language modeling project gains enhanced configurability and efficiency, putting us one step closer to our goal.
This utility class logger.py simplifies the process of capturing and organizing important execution details, ensuring clarity and transparency throughout our development process. You can use standard and pre-build resources for logging, but we wanted a bit more "understandable" approach here 😉.
import logging
from ShakespeareanGenerator.parameters import TrainingParameters
class Logger:
def __init__(self):
self.training_params = TrainingParameters()
self.log_file = self.training_params.LOG_PATH + self.training_params.LOG_NAME
logging.basicConfig(
filename=self.log_file,
filemode='w',
format='%(asctime)s | %(name)s → %(levelname)s: %(message)s',
level=logging.INFO
)
self.logger = logging.getLogger(__name__)
def log(self, level, message):
getattr(self.logger, level)(message)
def info(self, message):
self.log('info', message)
def warning(self, message):
self.log('warning', message)
def error(self, message):
self.log('error', message)
def critical(self, message):
self.log('critical', message)
Our language modeling project now gains visibility into execution progress and potential issues. The structured logging format ensures clarity and facilitates effective debugging during model development and training. Believe me, you'll need some good logs along the way.
To create a strong Shakespearean language model (and all the other language models in the world 😁), handling data efficiently is key for getting text ready to work with. Our DataHandler class, found in data_handler.py, is at the heart of managing these important data tasks for training and evaluating our language model and for this, we also need to know our data, right?
Let's take a closer look at the Tiny Shakespeare dataset, a valuable resource for building language models within the SAP AI Core framework. The Tiny Shakespeare dataset comprises 40,000 lines of Shakespeare from a variety of Shakespeare's plays, featured in Andrej Karpathy's blog post "The Unreasonable Effectiveness of Recurrent Neural Networks". It offers a manageable yet diverse collection of Shakespearean language, making it a practical choice for our case study.
Here are some captivating snippets of Shakespeare's texts from the dataset:
What Makes the Tiny Shakespeare Dataset Stand Out:
For further exploration, Andrej Karpathy offers an excellent breakdown of the Transformers architecture in his YouTube video "Let's build GPT: from scratch, in code, spelled out". Some aspects of the code we're discussing here resemble or are identical to what he demonstrates in the video, which can be immensely helpful for better comprehension of this whole blog.
import torch
from ShakespeareanGenerator.model.tokenizer import Tokenizer
from ShakespeareanGenerator.parameters import TrainingParameters
from ShakespeareanGenerator.logger import Logger
class DataHandler:
def __init__(self, path):
self.logging = Logger()
self.training_params = TrainingParameters()
self.path = path
self.data = None
def get_data(self):
try:
with open(self.path, 'r', encoding='utf-8') as file:
self.data = file.read()
except FileNotFoundError:
msg = 'File {} not found.'.format(self.path)
self.logging.error(msg)
raise FileNotFoundError(msg)
def get_batch(self, split):
if self.data is None:
self.get_data()
tokenizer = Tokenizer(
corpus=self.data,
vocab_size=self.training_params.dictionary_size
)
encoded_corpus = tokenizer.encode(self.data)
data = torch.tensor(encoded_corpus.ids, dtype=torch.long)
split_point = int(0.9 * len(data))
training_set, validation_set = data[:split_point], data[split_point:]
selected_data = training_set if split == 'train' else validation_set
indices = torch.randint(len(selected_data) - self.training_params.context_length, (self.training_params.batch_size,))
batches_x = []
batches_y = []
for index in indices:
batch_x = selected_data[index:index + self.training_params.context_length]
batch_y = selected_data[index + 1:index + self.training_params.context_length + 1]
batches_x.append(batch_x)
batches_y.append(batch_y)
x = torch.stack(batches_x)
y = torch.stack(batches_y)
x, y = x.to(self.training_params.device), y.to(self.training_params.device)
return x, y
@torch.no_grad()
def get_estimated_loss(self, model):
out = {}
model.eval()
for split in ['train', 'val']:
losses = torch.zeros(self.training_params.eval_steps)
for k in range(self.training_params.eval_steps):
X, Y = self.get_batch(split)
logits, loss = model(X, Y)
losses[k] = loss.item()
out[split] = losses.mean()
self.logging.info('Estimated losses: {}'.format(losses.mean()))
model.train()
return out
Before diving into the tokenizer code, let's talk about what tokenization is and how Byte Pair Encoding (BPE) works first.
Tokenization is the process of breaking down a text into smaller units called tokens. These tokens can be words, characters, or subwords. In natural language processing (NLP), tokenization is a structural step because it converts the raw text into a format that a machine learning model can understand (check the image below).
There are a bunch of algorithms to tokenize text, ranging from simple ones like splitting by spaces or punctuation to more complex methods like WordPiece and SentencePiece. For example, basic tokenization might just split on whitespace, while WordPiece is used by models like BERT, and SentencePiece can create subword units even for languages with complex morphology. We'll be using Byte Pair Encoding (BPE) because it strikes a good balance between simplicity and effectiveness, handling rare words well by breaking them down into more frequent subwords. This makes it particularly useful for languages with rich vocabularies and for tasks where handling out-of-vocabulary words is important.
Byte Pair Encoding (BPE) is a tokenization technique that starts with the basic characters and iteratively merges the most frequent pairs of tokens. This way, it builds a vocabulary of subword units. BPE is particularly effective for handling rare words and out-of-vocabulary terms by breaking them into more frequent subwords.
Now, let's look at the code that implements tokenization using the SentencePieceBPETokenizer from the tokenizers library.
from tokenizers import SentencePieceBPETokenizer
from ShakespeareanGenerator.parameters import TrainingParameters
class Tokenizer:
def __init__(self, corpus, vocab_size):
training_params = TrainingParameters()
self.TOKENIZER_MODEL_PATH = training_params.TOKENIZER_MODEL_PATH
self.sentences = corpus.split('\n')
self.vocab_size = vocab_size
self.tokenizer = None
def train_tokenizer(self):
special_tokens = ["<pad>", "<unk>", "<s>", "</s>", "<b>"]
self.tokenizer = SentencePieceBPETokenizer()
self.tokenizer.train_from_iterator(
self.sentences,
vocab_size = self.vocab_size,
min_frequency = 2,
special_tokens = special_tokens,
show_progress = False
)
self.tokenizer.save_model(self.TOKENIZER_MODEL_PATH)
def encode(self, text):
if not isinstance(text, str):
raise TypeError('Input text must be a string.')
try:
if self.tokenizer is None:
self.train_tokenizer()
return self.tokenizer.encode(text)
except Exception as e:
print('Error occurred during encoding: {}'.format(e))
raise
def decode(self, text):
if not isinstance(text, list):
raise TypeError('Input tokens must be a list.')
try:
if self.tokenizer is None:
self.train_tokenizer()
return self.tokenizer.decode(text)
except Exception as e:
print('Error occurred during encoding: {}'.format(e))
raise
With this Tokenizer class, you can easily manage tokenization using Byte Pair Encoding, making sure your text data is all set for further processing and model training. By getting a grasp on how tokenization and BPE work, you'll see why this preprocessing step is so foundational to Natural Language Processing (NLP) tasks.
Now, let's take a closer look at the ModelTrainer class in our language modeling project. This class handles the training process, logs important info, and optimizes the model's parameters. We'll go over its key functions and see how they help make our project successful.
class ModelTrainer:
def __init__(self, data_handler, model):
self.data_handler = data_handler
self.model = model
learning_parameters = sum(p.numel() for p in model.parameters()) / 1e6
msg_to_log = 'The model is learning {} million parameters.'.format(learning_parameters)
logging.info(msg_to_log)
msg_to_metrics = '{} million parameters.'.format(learning_parameters)
tracking.set_custom_info(
custom_info=[
MetricCustomInfo(name="Number of Parameters", value=str(msg_to_metrics))
]
)
self.optimizer = torch.optim.AdamW(
self.model.parameters(), lr=training_params.learning_rate
)
def train(self):
try:
for iteration in range(training_params.iteration_limit):
if iteration % training_params.eval_frequency == 0 or iteration == training_params.iteration_limit - 1:
logging.info('Epoch {} started'.format(iteration))
losses = self.data_handler.get_estimated_loss(self.model)
evaluation_msg = 'EPOCH {} | LOSS: Train {:.4f} Valid {:.4f}'.format(
str(iteration).ljust(5), losses['train'], losses['val']
)
logging.info(evaluation_msg)
tracking.set_custom_info(
custom_info=[
MetricCustomInfo(name="Epoch Status", value=str(evaluation_msg))
]
)
# Metric Logging: Step Information
training_loss_msg = '{:.4f}'.format(losses['train'])
validation_loss_msg = '{:.4f}'.format(losses['val'])
tracking.log_metrics(
metrics=[
Metric(
name="Training Loss",
value=float(training_loss_msg),
timestamp=datetime.now(timezone.utc),
step=iteration
),
Metric(
name="Validation Loss",
value=float(validation_loss_msg),
timestamp=datetime.now(timezone.utc),
step=iteration
),
]
)
batches_x, batches_y = self.data_handler.get_batch('train')
logging.info(f'Sent to Data Handler for Tokenization and Generating Batches for iteration {iteration}')
logits, loss = self.model(batches_x, batches_y)
logging.info(f'Forward Pass for iteration {iteration}')
self.optimizer.zero_grad(set_to_none=True)
loss.backward()
logging.info(f'Backward Pass for iteration {iteration}')
self.optimizer.step()
logging.info(f'Optimization Step for iteration {iteration}')
except Exception as e:
logging.error(f'Training failed at iteration {iteration} with error: {e}')
raise
Understanding and utilizing the ModelTrainer class is fundamental for effective model training and optimization. In our language modeling project, it is enabling training iterations, managing data, and monitoring model performance. Feel free to adapt and explore further to suit your specific machine learning initiatives!
Well, I think we've covered enough for now. You've come a long way, and it's time to wrap up this blog and talk about the next steps. So, let's get to it.
Congratulations on making it this far in the transformer-based language modeling topic with the Tiny Shakespeare dataset! In this blog, we've explored the implementation of a language model using Transformers from scratch. Amazing work! 😄
Let's recap what we've covered:
Now that we've laid the foundation for language modeling, stay tuned for the upcoming blogs in this series, where we'll explore how to deploy and enhance our model using SAP AI Core:
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
User | Count |
---|---|
58 | |
20 | |
12 | |
11 | |
11 | |
6 | |
6 | |
6 | |
6 | |
5 |