Technology Blogs by SAP
Learn how to extend and personalize SAP applications. Follow the SAP technology blog for insights into SAP BTP, ABAP, SAP Analytics Cloud, SAP HANA, and more.
cancel
Showing results for 
Search instead for 
Did you mean: 
carlosbasto
Product and Topic Expert
Product and Topic Expert
691

Introduction

Welcome back to our series "SAP AI Core is All You Need😀.

In this blog, we'll contiue our journey into the world of AI and Language Models using SAP AI Core and SAP AI Launchpad. Today, we're diving into an exciting topic: "Fine Tuning with Low-Rank Adaptation (LoRA)". We will explore the inner workings of fine-tuning pre-trained models, allowing them to improve in specific tasks, such as transforming modern English into Shakespearean prose.

What to Expect

In this blog, you will gain hands-on experience with the following key concepts:

  • Understanding Fine-Tuning: Learn what fine-tuning is and why it's a game-changer in machine learning.
  • Defining the Fine-Tuning Task: Discover how to set up a task for fine-tuning, specifically focusing on language style transfer to Shakespearean English.
  • Curating the Dataset: Understand the importance of a well-structured dataset and see examples of modern English sentences paired with their Shakespearean equivalents.
  • Performance Efficient Fine-Tuning (PEFT): Explore how PEFT can make fine-tuning large language models more efficient.
  • Low-Rank Adaptation (LoRA): Dive into LoRA, a technique that builds on PEFT to further optimize the fine-tuning process.
  • Implementing LoRA: Get a step-by-step guide to implementing LoRA in your models using PyTorch.
  • Deploying the Fine-Tuning Workflow: Learn how to deploy your fine-tuning workflow using SAP AI Core.

 

Adapting Pre-Trained Models for Specialized Domains

Fine-tuning is a game-changer when it comes to adapting pre-trained models for specific tasks. Imagine having a powerful model that's already learned a ton from vast amounts of data, and now you can tweak it to become a specialist in something new, like giving it a crash course in a specific domain.

Fine-tuning lets us take these pre-trained models, which are often trained on massive datasets for general tasks like image recognition or language understanding, and fine-tune them to get better in more specific areas. It's like taking a seasoned athlete and refining their skills for a particular sport.

In this blog post, we'll explore the ins and outs of fine-tuning – what it is, why it's important, and how it can supercharge your AI (and generative AI, of course) projects. Get ready to learn fine-tuning and level up your models!

Defining the Task for Fine-Tuning: Language Style Transfer

In our Transformers journey, one of the coolest tasks is language style transfer (TST). We're talking about taking modern English text and giving it that Shakespearean flair. This means fine-tuning an existing model to turn contemporary sentences into something that sounds like it came straight from the Bard himself, without losing the original meaning.

Our main goal is to fine-tune our Shakespeare model, currently a Shakespeare text generator, to handle language style transfer. Basically, we want to take everyday English sentences and transform them into Shakespearean dialect, capturing the rich language and rhythm of Elizabethan literature. Let’s give an example:

1.png

If you want you can give a try and see how our current Shakespeare Language Model does without fine-tuning:

2.png
Yeah, that’s not pretty! However, let’s be patience and wait for the fine-tuned model. Additionally, it's clear that our model will need to be really trained. The hyperparameters that we've been using are meant to be quickly executed so you can move forward; but, if you want better results, you'll need to train it more. Don't worry, we'll be talking about this later!

 

Defining the Dataset for the Fine-Tuning Task

To accomplish this linguistic transfer, we will leverage a curated dataset comprising pairs of modern English sentences paired with their Shakespearean counterparts. This dataset serves as our foundation, providing the necessary training examples for the model to learn and internalize the stylistic nuances of Shakespeare's language.

The dataset is composed of:

  • Modern English Sentences: Everyday phrases and expressions reflecting contemporary usage.
  • Shakespearean Equivalents: Corresponding sentences or phrases crafted in the stylistic flair and vocabulary reminiscent of Shakespearean prose. These may include translated excerpts or paraphrased adaptations tailored to evoke the essence of classical English literature.

Let’s take a look at some examples of this dataset:

1.png

The generated dataset contains 7425 lines with different contexts and topics being transferred from modern english to shakespearean equivalent. However, we'll also provided an Augmented fine tuning dataset so you can go further with your experiments. Both files and the code for augmentation are available in the source repository (check folder ai-core-tools 😉).

In our code, we've introduced a change to refine the instruct dataset for fine-tuning our model in Text Style Transfer. By opening the dataset file and reading its contents, we split the data into lines and then process each line to separate modern English sentences from their Shakespearean counterparts using a semicolon as the delimiter. For each pair of sentences, we format them by adding special tokens <ME> for the modern sentence, <STYLE_SHIFT> for the transition to the Shakespearean style, and <SK> to mark the end. These formatted sentences are then combined into a new dataset.

 

    def get_data(self):
        try:
            with open(self.path, 'r', encoding='utf-8') as file:
                data = file.read()
            lines = data.splitlines()
            formatted_sentences = []
            for line in lines:
                modern, shakespearean = line.split(';')
                formatted_sentence = f'<ME>{modern.strip()}<STYLE_SHIFT>{shakespearean.strip()}<SK>'
                formatted_sentences.append(formatted_sentence)
            self.data = formatted_sentences
        except FileNotFoundError:
            msg = 'File {} not found.'.format(self.path)
            self.logging.error(msg)
            raise FileNotFoundError(msg)

 

This change is promising for a few reasons. First, the addition of special tokens helps the model distinguish between different parts of the input, making it easier to learn the structure and nuances of the task. Second, by explicitly marking where the style shift occurs, we provide clear guidance to the model on how to transform the text. This level of specificity is particularly important in AI language models, as it enhances their ability to understand and generate text in distinct styles. Ultimately, these refinements improve the model's performance and accuracy in style transfer tasks, allowing it to produce more coherent and contextually appropriate outputs.

In the upcoming stages of our exploration, we will delve deeper into the implementation of fine-tuning techniques and the transformative journey of our model as it evolves into a proficient language stylist. 🎭

 

Performance Efficient Fine-Tuning (PEFT)

Before diving into LoRA, let's explore Performance Efficient Fine-Tuning (PEFT). We mentioned that fine-tuning, while powerful, can be computationally expensive due to the huge number of parameters involved in large language models. PEFT tackles this challenge by focusing on a core principle:

  • Not all parameters in a pre-trained LLM are equally important for every task.

PEFT capitalizes on this by strategically modifying only a subset of the parameters that are most relevant to the new task you're fine-tuning for. Imagine our superhero again. Fine-tuning traditionally might adjust every aspect of their abilities, even their super strength which isn't necessarily relevant for rescuing kittens. PEFT, on the other hand, would focus on refining their agility and climbing – the skills most critical for the task.

Here's how PEFT achieves this:

  • Identifying Key Parameters: Techniques within PEFT help pinpoint the parameters in the pre-trained LLM that hold the most influence for the specific task.
  • Modifying Only the Relevant Ones: Once identified, PEFT focuses on adjusting only these key parameters, significantly reducing the overall computational cost.

Benefits of PEFT:

  • Faster Training: By focusing on a smaller subset of parameters, PEFT allows for faster fine-tuning compared to traditional methods.
  • Reduced Resource Consumption: With fewer parameters to modify, PEFT requires less computational power and memory, making it more efficient.

PEFT paves the way for LoRA:

PEFT establishes a foundation for even more efficient fine-tuning. LoRA, as we'll see next, builds upon this concept by introducing a special type of layer called a LoRA layer to further reduce the number of parameters needed for fine-tuning.

 

Low-Rank Adaptation (LoRA)

We've seen how Performance Efficient Fine-Tuning (PEFT) helps identify and modify the most relevant parameters in a pre-trained LLM for a specific task. Now, let's meet the champion of efficiency – Low-Rank Adaptation (LoRA).

LoRA takes PEFT to the next level by introducing a special type of layer called a "LoRA layer." Imagine these LoRA layers as specialized modules that can be strategically placed within the LLM architecture. Here's the magic:

  • Low-Rank Matrices: LoRA layers utilize low-rank matrices to capture the essential adjustments needed for the new task. Think of these matrices as efficient blueprints for the fine-tuning process.
  • Significantly Fewer Parameters: The beauty of LoRA lies in its ability to achieve effective fine-tuning with far fewer parameters compared to directly modifying the original LLM parameters. It's like achieving the same level of control over the superhero's agility and climbing with a much simpler set of adjustments.

Benefits of LoRA:

  • Faster Training Times: Thanks to the reduced number of parameters involved, LoRA significantly accelerates the fine-tuning process.
  • Lower Memory Footprint: LoRA's efficiency allows you to fine-tune LLMs on devices with less memory, opening up deployment possibilities on various platforms.
  • Easier Deployment: Smaller models resulting from LoRA fine-tuning are easier to integrate and deploy into different environments.

LoRA doesn't require drastic changes to our existing LLM architecture. We can strategically insert LoRA layers into specific parts of the model to achieve fine-tuning for your desired task. We'll see this concept in action in the next section where we'll explore the code changes introduced with LoRA.

 

Implementing LoRA

While we're implementing LoRA (Low-Rank Adaptation) here ourselves, it's worth mentioning that there are fantastic libraries out there that can save you time and effort. We went this route because we also built our decoder-only transformer language model from scratch, so incorporating LoRA from the ground up fit our workflow This isn't to say building it yourself is always the best approach to learn! In fact, for most folks, using established libraries like Hugging Face Transformers or TensorFlow with Keras (both supports LoRA) would be a much faster and more efficient way to go. These libraries provide pre-trained models, tested implementations of various techniques like LoRA, and a supportive community – a win-win! We're just taking the scenic route this time.

Introducing the LoRALayer

In the world of deep learning, optimizing large models with numerous parameters can be quite challenging. Enter LoRA, or Low-Rank Adaptation, a clever technique designed to make fine-tuning large language models more efficient. Let’s dive into how we can implement this using PyTorch with our custom LoRALayer.

Our LoRALayer class is a PyTorch module designed to introduce low-rank adaptation to existing linear transformations. Here's a step-by-step breakdown:

 

class LoRALayer(nn.Module):
    def __init__(self, in_dim, out_dim):
        super().__init__()
        std_dev = 1 / torch.sqrt(torch.tensor(lora_params.r).float())
        self.A = nn.Parameter(torch.randn(in_dim, lora_params.r) * std_dev)
        self.B = nn.Parameter(torch.zeros(lora_params.r, out_dim))
        self.scaling = lora_params.alpha / lora_params.r

    def forward(self, x):
        return self.scaling * (x @ self.A @ self.B)

 

  • Initialization:

    • Parameters:
      • in_dim and out_dim are the input and output dimensions of the layer.
      • rank determines the size of the low-rank matrices, balancing between efficiency and expressiveness.
      • alpha is a scaling factor that adjusts the influence of the LoRA component.
    • Initialization of Matrices:
      • self.A is initialized with random values scaled by the standard deviation derived from the rank.
      • self.B starts as a zero matrix, both of which are learnable parameters.
  • Forward Pass:

    • Matrix Multiplication:
      • The input x is multiplied by self.A and then by self.B. This sequence of matrix multiplications represents a low-rank approximation of the original transformation.
    • Scaling:
      • The product is then scaled by self.alpha, controlling the impact of the LoRA transformation on the final output.

     

The LinearWithLoRA Class

 

class LinearWithLoRA(nn.Module):
    def __init__(self, linear):
        super().__init__()
        self.linear = linear
        self.lora = LoRALayer(linear.in_features, linear.out_features)

    def forward(self, x):
        return self.linear(x) + self.lora(x)

 

  • Initialization:

    • Parameters:
      • linear: The standard linear layer to be enhanced with LoRA.
      • rank and alpha: they are not needed anymore as they were already set globally in LoRALayer class.
    • Initialization of Components:
      • self.linear is the original linear layer.
      • self.lora is an instance of the LoRALayer, initialized with the input and output features of the linear layer, along with the rank and alpha parameters.
  • Forward Pass:

    • Combining Outputs:
      • The input x is first passed through the original linear layer (self.linear(x)).
      • Simultaneously, the input is passed through the LoRA layer (self.lora(x)).
      • The outputs of these two transformations are then summed to produce the final result. This combination allows the model to benefit from both the standard linear transformation and the efficient low-rank adaptation.

 

Applying LoRA to the Transformer Layers

In the paper titled "LoRA: Low-Rank Adaptation of Large Language Models" researchers discussed applying LoRA to specific weight matrices within a neural network to reduce the number of parameters and improve efficiency. Specifically, they focused on the Transformer architecture, targeting weight matrices in the self-attention and MLP (Multi-Layer Perceptron) modules.

Let's break down what this means. In a Transformer, there are several key weight matrices, such as Wq, Wk, Wv, and Wo in the self-attention module. These matrices handle the computations for attention mechanisms. The idea of LoRA is to adaptively reduce the rank of these matrices, thus lowering the number of parameters without sacrificing performance.

From the paper “Low-rank Adaptation Method for Wav2vec2-based Fake Audio Detection”, we’ve adapted the “Figure 1: Transformer architecture in wav2vec2 along with LoRA” to make easier to understand what we are doing in the context of the Transformers.

4.png

In transformer models, attention heads are the basis for capturing relationships between different parts of the input sequence. The HeadWithLoRA class integrates Low-Rank Adaptation (LoRA) into these attention heads to enhance efficiency and performance. Let’s explore how this class works.

Note: According to the paper LoRA: Low-Rank Adaptation of Large Models, Section 4.2 states that LoRA is only applied to the attention heads. However, recent and common practices also involve applying it to MLP layers. In the paper, these layers are frozen for simplicity and parameter efficiency. But, since this is a very small example, training the fully connected layers can be beneficial. It's up to you!

The HeadWithLoRA Class

The HeadWithLoRA class is designed to add LoRA components to the key, query, and value transformations within an attention head. Here’s how it works:

 

class HeadWithLoRA(nn.Module):
    def __init__(self, original_head):
        super().__init__()
        self.key = LinearWithLoRA(original_head.key)
        self.query = LinearWithLoRA(original_head.query)
        self.value = LinearWithLoRA(original_head.value)
        self.tril = original_head.tril
        self.dropout = nn.Dropout(training_params.dropout)

    def __compute_weights(self, x):
        B, T, C = x.shape
        k = self.key(x)
        q = self.query(x)
        weights = q @ k.transpose(-2, -1) * (k.shape[-1] ** -0.5)
        weights = weights.masked_fill(self.tril[:T, :T] == 0, float('-inf'))
        weights = F.softmax(weights, dim=-1)
        weights = self.dropout(weights)
        return weights

    def forward(self, x):
        weights = self.__compute_weights(x)
        v = self.value(x)
        return weights @ v

 

  • Initialization:

    • Key, Query, and Value Layers:
      • The class initializes key, query, and value layers using LinearWithLoRA. This means each of these layers is enhanced with a LoRA component.
      • These layers transform the input data into the key, query, and value vectors, respectively.
    • Lower Triangular Mask:
      • A buffer named tril is registered to store a lower triangular matrix. This matrix is used to mask out future positions in the sequence, ensuring the model attends only to previous and current positions.
    • Dropout Layer:
      • A dropout layer is included to help prevent overfitting by randomly setting a fraction of input units to zero during training.
  • Computing Attention Weights:

    • Input Transformation:
      • The input x is passed through the key and query layers to produce the key (k) and query (q) vectors.
    • Scaled Dot-Product Attention:
      • The dot product of q and the transpose of k is computed, scaled by the square root of the key dimension to stabilize gradients.
      • The resulting weights are masked using the lower triangular matrix to prevent attending to future positions.
      • Softmax is applied to normalize the weights, and dropout is applied to the normalized weights.
  • Forward Pass:

    • Attention Application:
      • The input x is passed through the value layer to produce the value (v) vectors.
      • The attention weights are then used to compute a weighted sum of the value vectors, producing the final output of the attention head.

The MultiHeadAttentionWithLoRA Class

The MultiHeadAttentionWithLoRA class is designed to add LoRA components to a multi-head attention mechanism. Here’s how it works:

 

class MultiHeadAttentionWithLoRA(nn.Module):
    def __init__(self, original_mha):
        super().__init__()
        self.heads = nn.ModuleList([HeadWithLoRA(original_mha.heads[i]) for i in range(len(original_mha.heads))])
        self.projection = LinearWithLoRA(original_mha.projection)

    def forward(self, x):
        head_outputs = [head(x) for head in self.heads]
        out = torch.cat(head_outputs, dim=-1)
        out = self.projection(out)
        return out

 

  • Initialization:

    • Attention Heads:
      • The class initializes multiple attention heads using HeadWithLoRA, each enhanced with LoRA. This is done by creating a list of HeadWithLoRA instances, one for each attention head.
    • Projection Layer:
      • After the attention heads, a projection layer is initialized using LinearWithLoRA. This layer combines the outputs of all attention heads and projects them back to the desired output dimension.
  • Forward Pass:

    • Computing Head Outputs:
      • The input x is passed through each attention head. The outputs from all heads are collected in a list.
    • Concatenating Head Outputs:
      • The outputs from all attention heads are concatenated along the last dimension, forming a single tensor that aggregates the information from all heads.
    • Projecting Output:
      • This concatenated tensor is then passed through the projection layer to produce the final output of the multi-head attention mechanism.

The FeedForwardWithLoRA Class

The FeedForwardWithLoRA class is designed to add LoRA components to the feedforward neural networks within transformer models. Here’s how it works:

 

class FeedForwardWithLoRA(nn.Module):
    def __init__(self, original_ffn):
        super().__init__()
        self.ffnet = nn.Sequential(
            LinearWithLoRA(original_ffn.ffnet[0]),
            original_ffn.ffnet[1],
            LinearWithLoRA(original_ffn.ffnet[2]),
            original_ffn.ffnet[3]
        )

    def forward(self, x):
        return self.ffnet(x)

 

  • Initialization:

    • Feedforward Network:
      • The class initializes a feedforward network using nn.Sequential to stack multiple layers. This network consists of:
        • First Linear Layer with LoRA:
          • A linear layer that expands the input dimension (embedding_dim) to four times its size. This layer is enhanced with LoRA.
        • ReLU Activation:
          • A ReLU activation function is applied in place to introduce non-linearity and enable the model to learn complex patterns.
        • Second Linear Layer with LoRA:
          • Another linear layer that projects the expanded dimension back to the original embedding_dim. This layer is also enhanced with LoRA.
        • Dropout Layer:
          • A dropout layer is included to help prevent overfitting by randomly setting a fraction of input units to zero during training.
  • Forward Pass:

    • Processing Input:
      • The input x is passed through the feedforward network (self.ffnet), which sequentially applies the two linear layers with LoRA, the ReLU activation, and the dropout layer.
    • Returning Output:
      • The processed output is returned as the final result of the feedforward network.

 

Introducing the PEFTModel Class

In the core of transformer models, fine-tuning large-scale pre-trained models efficiently is a significant challenge. The PEFTModel class (Parameter-Efficient Fine-Tuning Model) integrates Low-Rank Adaptation (LoRA) into pre-trained transformer models, making the fine-tuning process more efficient. Let’s dive into how this class works.

The PEFTModel Class

The PEFTModel class is designed to enhance a pre-trained transformer model with LoRA components. Here’s how it works:

 

class PEFTModel(nn.Module):
    def __init__(self, pretrained_model):
        super().__init__()
        self.pretrained_model = pretrained_model
        for i, block in enumerate(self.pretrained_model.transformer_blocks):
            self.pretrained_model.transformer_blocks[i].self_attn = MultiHeadAttentionWithLoRA(block.self_attn)
            
            # according to the paper, in Section 4.2 LoRA is only applied on the Attention heads. However, 
            # recent and common practices involve to apply it also to MLP layers, if you do not want to do so, 
            # comment this line:
            self.pretrained_model.transformer_blocks[i].feed_forward = FeedForwardWithLoRA(block.feed_forward)

        self.freeze_non_lora_layers(self.pretrained_model)
        
    def forward(self, index, targets=None):
        logits, loss = self.pretrained_model(index, targets)
        return logits, loss

 

    • Initialization:

      • Input: Takes a pretrained_model, which is a transformer model that has already been trained on a large dataset.

      • Super Initialization: Calls the parent class (nn.Module) constructor to initialize the PyTorch module.

      • Store Pretrained Model: Stores the given pre-trained model in the class instance.

      • Iterate Over Transformer Blocks: Loops through each transformer block in the pre-trained model:

        • Replace Self-Attention Layer: Replaces the existing self-attention layer with a MultiHeadAttentionWithLoRA layer, which incorporates LoRA.
        • Replace Feed-Forward Layer: Replaces the existing feed-forward layer with a FeedForwardWithLoRA layer, also incorporating LoRA.

         

      •  Freeze Non-LoRA Layers: Calls the freeze_non_lora_layers method to freeze parameters that are not part of the LoRA layers.
  • Forward Method:

    • Input: Takes index (input data) and targets (optional, usually for supervised learning).

    • Output: Returns the logits (model predictions) and loss (if targets are provided).

    • Functionality: Passes the input data through the pre-trained model, which has been modified to include LoRA layers.

The PEFTModel class modifies a pre-trained transformer model by incorporating LoRA layers, replacing the self-attention and feed-forward layers with LoRA-enhanced versions, and freezing the original model's parameters to make fine-tuning more computationally efficient while still allowing the model to adapt to new tasks. Wasn't that hard, right?

Freezing and Unfreezing the Grad

Since we're using LoRA, our strategy involves freezing all parameters that aren't part of the LoRA layers or the ones from MLP layers. This approach includes applying LoRA to both the attention and feed-forward layers, ensuring that only the parameters in these LoRA-enhanced layers are trainable, allowing us to focus our training efforts on the most relevant parts of the model.

 

    def freeze_non_lora_layers(self, module):
        for name, param in module.named_parameters():

            # in original paper, these layers are frozen for both for simplicity
            # and parameter-efficiency. However, as this is a very small example, 
            # training the fully-connected layers was beneficial. To to so, try:
            if "lora" not in name and ".feed_forward.ffnet." not in name:
            # instead of:
            # if "lora" not in name:
                param.requires_grad = False
            else:
                param.requires_grad = True

 

  • Freeze Non-LoRA Layers Method:

    • Input: Takes a module, which in this case is the pre-trained model.

    • Functionality: Iterates over all named parameters in the model:

      • Check Parameter Names: If the parameter name does not include "lora", it sets requires_grad to False, freezing the parameter (preventing it from being updated during training).

      • Enable Gradients for LoRA: If the parameter name includes "lora", it sets requires_grad to True, allowing it to be updated during training.

Take a look at the Pre-Trained model:

 

ShakespeareanLanguagelModel(
  (embeddings): Embedding(1024, 4)
  (position_embeddings): Embedding(16, 4)
  (transformer_blocks): Sequential(
    (0): TransformerBlock(
      (self_attn): MultiHeadAttention(
        (heads): ModuleList(
          (0-1): 2 x Head(
            (key): Linear(in_features=4, out_features=2, bias=False)
            (query): Linear(in_features=4, out_features=2, bias=False)
            (value): Linear(in_features=4, out_features=2, bias=False)
            (dropout): Dropout(p=0.2, inplace=False)
          )
        )
        (projection): Linear(in_features=4, out_features=4, bias=True)
        (dropout): Dropout(p=0.2, inplace=False)
      )
      (feed_forward): FeedForward(
        (ffnet): Sequential(
          (0): Linear(in_features=4, out_features=16, bias=True)
          (1): ReLU(inplace=True)
          (2): Linear(in_features=16, out_features=4, bias=True)
          (3): Dropout(p=0.2, inplace=False)
        )
      )
      (layer_norm1): LayerNorm((4,), eps=1e-05, elementwise_affine=True)
      (layer_norm2): LayerNorm((4,), eps=1e-05, elementwise_affine=True)
    )
    (1): TransformerBlock(
      (self_attn): MultiHeadAttention(
        (heads): ModuleList(
          (0-1): 2 x Head(
            (key): Linear(in_features=4, out_features=2, bias=False)
            (query): Linear(in_features=4, out_features=2, bias=False)
            (value): Linear(in_features=4, out_features=2, bias=False)
            (dropout): Dropout(p=0.2, inplace=False)
          )
        )
        (projection): Linear(in_features=4, out_features=4, bias=True)
        (dropout): Dropout(p=0.2, inplace=False)
      )
      (feed_forward): FeedForward(
        (ffnet): Sequential(
          (0): Linear(in_features=4, out_features=16, bias=True)
          (1): ReLU(inplace=True)
          (2): Linear(in_features=16, out_features=4, bias=True)
          (3): Dropout(p=0.2, inplace=False)
        )
      )
      (layer_norm1): LayerNorm((4,), eps=1e-05, elementwise_affine=True)
      (layer_norm2): LayerNorm((4,), eps=1e-05, elementwise_affine=True)
    )
  )
  (layer_norm): LayerNorm((4,), eps=1e-05, elementwise_affine=True)
  (output): Linear(in_features=4, out_features=1024, bias=True)
)

 

Now, after applying LoRA, notice the key differences:

 

PEFTModel(
    (embeddings): Embedding(1024, 4)
    (position_embeddings): Embedding(16, 4)
    (transformer_blocks): Sequential(
      (0): TransformerBlock(
        (self_attn): MultiHeadAttention(
          (heads): ModuleList(
            (0-1): 2 x Head(
              (key): Linear(in_features=4, out_features=2, bias=False)
              (query): Linear(in_features=4, out_features=2, bias=False)
              (value): Linear(in_features=4, out_features=2, bias=False)
              (dropout): Dropout(p=0.2, inplace=False)
            )
          )
          (projection): Linear(in_features=4, out_features=4, bias=True)
          (dropout): Dropout(p=0.2, inplace=False)
        )
        (feed_forward): FeedForward(
          (ffnet): Sequential(
            (0): Linear(in_features=4, out_features=16, bias=True)
            (1): ReLU(inplace=True)
            (2): Linear(in_features=16, out_features=4, bias=True)
            (3): Dropout(p=0.2, inplace=False)
          )
        )
        (layer_norm1): LayerNorm((4,), eps=1e-05, elementwise_affine=True)
        (layer_norm2): LayerNorm((4,), eps=1e-05, elementwise_affine=True)
      )
      (1): TransformerBlock(
        (self_attn): MultiHeadAttention(
          (heads): ModuleList(
            (0-1): 2 x Head(
              (key): Linear(in_features=4, out_features=2, bias=False)
              (query): Linear(in_features=4, out_features=2, bias=False)
              (value): Linear(in_features=4, out_features=2, bias=False)
              (dropout): Dropout(p=0.2, inplace=False)
            )
          )
          (projection): Linear(in_features=4, out_features=4, bias=True)
          (dropout): Dropout(p=0.2, inplace=False)
        )
        (feed_forward): FeedForward(
          (ffnet): Sequential(
            (0): Linear(in_features=4, out_features=16, bias=True)
            (1): ReLU(inplace=True)
            (2): Linear(in_features=16, out_features=4, bias=True)
            (3): Dropout(p=0.2, inplace=False)
          )
        )
        (layer_norm1): LayerNorm((4,), eps=1e-05, elementwise_affine=True)
        (layer_norm2): LayerNorm((4,), eps=1e-05, elementwise_affine=True)
      )
    )
    (layer_norm): LayerNorm((4,), eps=1e-05, elementwise_affine=True)
    (output): Linear(in_features=4, out_features=1024, bias=True)
  )
  (lora_attn): MultiHeadAttention(
    (heads): ModuleList(
      (0-1): 2 x Head(
        (key): LinearWithLoRA(
          (linear): Linear(in_features=4, out_features=2, bias=False)
          (lora): LoRALayer()
        )
        (query): LinearWithLoRA(
          (linear): Linear(in_features=4, out_features=2, bias=False)
          (lora): LoRALayer()
        )
        (value): LinearWithLoRA(
          (linear): Linear(in_features=4, out_features=2, bias=False)
          (lora): LoRALayer()
        )
        (dropout): Dropout(p=0.2, inplace=False)
      )
    )
    (projection): Linear(in_features=4, out_features=4, bias=True)
    (dropout): Dropout(p=0.2, inplace=False)
    (o_LoRA): LinearWithLoRA(
      (linear): Linear(in_features=4, out_features=4, bias=False)
      (lora): LoRALayer()
    )
  )
  (lora_adaptation): LinearWithLoRA(
    (linear): Linear(in_features=4, out_features=4, bias=False)
    (lora): LoRALayer()
  )
)

 

  • LoRA Integration:
    • PEFTModel integrates LoRA into the attention and feedforward layers, whereas ShakespeareanLanguageModel uses standard layers.
  • Parameter Efficiency:
    • By freezing non-LoRA layers, PEFTModel significantly reduces the number of trainable parameters, making fine-tuning more efficient.
  • Layer Composition:
    • The PEFTModel's attention and feedforward mechanisms are enhanced with LinearWithLoRA, providing a low-rank approximation that maintains performance while optimizing resource usage.

Notice that the hyperparameters chosen here, such as context_length set to 16 and embedding_dim set to 4, were selected to keep things faster. We'll be setting these more seriously soon!

We've reached the end of our LoRA implementation from scratch. Now, it's time to deploy it, right?

Workflow Templates Definition for Fine-Tuning

We've transitioned from the shakespeare-model-chkp template to the new shakespeare-model-tuner.

  • Executables Update: The Shakespeare-language-model-trainer-checkpointer is now the Shakespeare-language-model-tuner. This update reflects the focus on tuning rather than initial training.
  • Dataset and Model Description: The dataset description has been refined to specify the "Instruct Dataset Modern vs Shakespeare," highlighting the unique characteristics of the dataset we're working with. The model is now labeled as a "Fine Tuned Language Model" to indicate the refined nature of our target.
  • Setup and Training Logs: The setup and training logs have been updated to reflect the fine-tuning process. We'll now have detailed "Fine-Tuning Logs" to track the specific adjustments made during this phase.

 

apiVersion: argoproj.io/v1alpha1
kind: WorkflowTemplate
metadata:
  name: "shakespeare-model-tuner"
  annotations:
    scenarios.ai.sap.com/name: "shakespeare-language-model"
    scenarios.ai.sap.com/description: "Shakespeare Language Model"
    executables.ai.sap.com/name: "Shakespeare-language-model-tuner"
    executables.ai.sap.com/description: "Shakespeare Language Model Tuner Executable"
    artifacts.ai.sap.com/data.kind: "dataset"
    artifacts.ai.sap.com/data.description: "Instruct Dataset Modern vs Shakespeare" 
    artifacts.ai.sap.com/model.kind: "model"
    artifacts.ai.sap.com/model.description: "Fine Tuned Language Model"
    artifacts.ai.sap.com/model.labels: | 
        {"ext.ai.sap.com/step":"fine-tune", "ext.ai.sap.com/version":"0.0.1"}         
    artifacts.ai.sap.com/bpe_model.kind: "model"
    artifacts.ai.sap.com/bpe_model.description: "Byte-Pair Encoding Tokenizer"    
    artifacts.ai.sap.com/bpe_model.labels: | 
        {"ext.ai.sap.com/step":"fine-tune", "ext.ai.sap.com/version":"0.0.1"}     
    artifacts.ai.sap.com/setuplogs.kind: "other"
    artifacts.ai.sap.com/setuplogs.description: "Fine-Tuning Logs"
    artifacts.ai.sap.com/setuplogs.labels: | 
        {"ext.ai.sap.com/step":"setup", "ext.ai.sap.com/version":"0.0.1"}
    artifacts.ai.sap.com/logs.kind: "other"
    artifacts.ai.sap.com/logs.description: "Model Training Logs"
    artifacts.ai.sap.com/logs.labels: | 
        {"ext.ai.sap.com/step":"fine-tune", "ext.ai.sap.com/version":"0.0.1"}    
  labels:
    scenarios.ai.sap.com/id: "shakespeare-language-model"
    executables.ai.sap.com/id: "shakespeare-tuner"
    ai.sap.com/version: "0.0.1"

 

  • Additional Parameters: We've introduced new parameters like LORA_RANK and LORA_ALPHA to fine-tune the attention mechanism within our model. These additions enable us to optimize computation and enhance overall performance. Both were made available as environment variable for step 2:

 

      - name: LORA_RANK
	        description: Controls the rank or order of the LoRA (Low Rank Attention) mechanism used in the model. It determines the level of low-rank approximation employed to reduce computation in attention mechanisms.
	      - name: LORA_ALPHA
	        description: Specifies the scaling factor or coefficient applied within the LoRA (Low Rank Attention) mechanism. It regulates the influence or importance of the low-rank approximation on the attention scores, impacting computational efficiency and model performance.

      - name: LORA_RANK 
	        value: "{{workflow.parameters.LORA_RANK}}"
	      - name: LORA_ALPHA 
	        value: "{{workflow.parameters.LORA_ALPHA}}"  

 

In our fine-tuning process, we are strategically choosing which parameters to adjust and which to keep fixed. This approach allows us to leverage the benefits of the pre-trained model while adapting it to new tasks with minimal changes.

For the parameters retained from pre-training, we are using a context length of 256 tokens. This is tied to the architecture of the pre-trained model and changing it would disrupt the structure of the model. The embedding dimension is set to 384, which is a core architectural parameter influencing the entire network. Altering it would necessitate reinitializing the embedding layer and related components, negating the benefits of pre-training. We have 6 attention heads, an important parameter for the model's attention mechanism, and any change would require a comprehensive rework of the attention layers. The model depth is defined by 6 layers, and changing the number of layers would involve adding or removing entire layers, which isn't compatible with the pre-trained architecture. Lastly, the vocabulary size is fixed at 65 + 3 + 6 (65 base characters + 3 new special characters for fine tuning + 6 special characters during pre-training), ensuring that the embeddings and subsequent layers correctly map the input tokens. Changing the dictionary size would require reinitializing the embedding layer.

For the parameters selected for fine-tuning, the batch size can be adjusted based on computational resources and training stability. The number of iterations to run the fine-tuning process is defined by the iteration limit. The evaluation frequency determines how often to evaluate the model during fine-tuning, while the evaluation steps specify the number of steps to perform during each evaluation phase. The learning rate, typically set lower than the pre-training learning rate, is crucial for fine-tuning to ensure stable and gradual updates. The dropout rate can be fine-tuned to prevent overfitting during the training process.

 

# Import necessary libraries
import os
import torch

class TrainingParameters:
    def __init__(self):
        # PARAMETERS AVAILABLE FOR UNPICKLE PRE-TRAINED MODEL
        # context_length is tied to the architecture of the pre-trained model
        #self.context_length = int(os.environ.get('CONTEXT_LENGTH'))
        self.context_length = 256
        # The embedding dimension and attention heads are core architectural 
        # parameters. Changing it would require reinitializing the embedding 
        # layer and other dependent components, which would negate the benefits 
        # of pre-training.
        #self.embedding_dim = int(os.environ.get('EMBEDDING_DIM'))
        self.embedding_dim = 384
        #self.attention_heads = int(os.environ.get('ATTENTION_HEADS'))
        self.attention_heads = 6
        # The number of layers in the model defines its depth. Changing this 
        # would require adding or removing layers, which is not compatible 
        # with the pre-trained model architecture 
        #self.num_layers = int(os.environ.get('NUM_LAYERS'))
        self.num_layers = 6
        # The size of the dictionary (vocabulary size) must remain the same
        # to ensure that the embeddings and subsequent layers correctly map 
        # the input tokens. Changing this would require reinitializing the 
        # embedding layer.
        #self.dictionary_size = int(os.environ.get('DICTIONARY_SIZE'))
        self.dictionary_size = 65 + 3 + 6 # 3 new special chars from fine tuning 
                                          # and 6 from training
                
        #________________________________           
        # SELECTABLE PARAMETERS FOR FINE TUNING
        # it can adjust the batch size during fine-tuning based on your 
        # computational resources and the stability of training
        self.batch_size = int(os.environ.get('BATCH_SIZE'))
        self.iteration_limit = int(os.environ.get('ITERATION_LIMIT'))
        self.eval_frequency = int(os.environ.get('EVAL_FREQUENCY'))
        self.eval_steps = int(os.environ.get('EVAL_STEPS'))
        # The learning rate for fine-tuning should generally be lower than 
        # the learning rate used during pre-training
        self.learning_rate = float(os.environ.get('LEARNING_RATE'))
        self.dropout = float(os.environ.get('DROPOUT'))
        self.device = 'cuda' if torch.cuda.is_available() else 'cpu'

        self.DATA_PATH = '/app/data/shakespeare-style-transfer.txt'
        self.MODEL_PATH = '/app/model/'
        self.TOKENIZER_MODEL_PATH = '/app/tokenizer/'
        self.LOG_PATH = '/app/logs/'
        self.LOG_NAME = 'fine_tuning_logs.log'
        self.INPUT_MODEL = '/app/input_model/'
        self.INPUT_TOKENIZER_MODEL = '/app/input_tokenizer/'

class LoraParameters:
    def __init__(self):
        self.r = int(os.environ.get('LORA_RANK'))
        self.alpha = int(os.environ.get('LORA_ALPHA'))

 

In addition, the workflow structure remains consistent, focusing on setup and training pipelines as before, but tailored to the requirements of our fine-tuning objectives. These changes are pretty simple, right? Of course you can find the full file in the github repository.

 

Deploy fine-tuning workflow using SAP AI Core

Alright, we've finished coding! Now, it's time to deploy. If you've been following along, you should have everything you need. If not, take a quick look at the previous blogs. We'll wait for you. Ready? Let's dive in!

Begin by copying or downloading the code for the ai-core-fine-tuner and ai-core-fine-tuner-setup from the GitHub repository. Once you've done that, you'll end up with a folder that looks something like this:

16.png

The code is searching for the raw instruct dataset file on GitHub (just like tinyshakespeare.txt for training), but you can also access it locally if you've already pulled from GitHub (ai-core-datasets). Feel free to choose whichever method works best for you! 

As usual, let's set up a Configuration with the following parameters:

12.png

 In addition to our usual parameters, we have introduced two new ones:

13.png

In accordance with the paper "LoRA: Low-Rank Adaptation of Large Language Models", when selecting the optimal rank (r) for LoRA, it's recommended to start with smaller values and experiment to strike the right balance between model performance and efficiency for your specific task and dataset. Experimentation is key!

Finally, map the artifacts (or the paths in S3), similar to what we did for the checkpointer:

14.png

Yes, the "data" path remains the same as before because we're using it to save the instruct dataset (and I admit, it's also a bit of convenience! 😊). However, feel free to create a different artifact if necessary.

Before we kick things off, take a moment to make sure those Docker images (the ones mentioned in finetune_template.yaml) are up on Docker Hub (or wherever your image registry is).

15.png

If they’re there! Looks like we're all set to run it!

Experimenting and Evaluating results for Fine Tuning model in SAP AI Launchpad

Now that we've run it, the "Process Overview" tab gives us a complete view of the workflow, as you already know.

16.png

Analyzing and Checking Artifacts

Once the workflow is finished, we'll have the fine-tuned model generated (style-transfer-model.pkl). Let's verify it in the S3 path:

21.png

Similarly, the "Metric Resources" tab empowers us to track intermediate losses (for both training and validation), examine the total number of parameters, and identify which parameters are trainable. This functionality directly stems from the LoRA implementation, enhancing our ability to analyze and optimize model performance.

17.png

Looking at the results, it appears that our model is overfitting - it's performing well on the training data but not generalizing effectively to unseen validation data. Not to worry, we can optimize the model's performance by tweaking the hyperparameters. We'll apply regularization techniques and experiment with LoRA scaling ratios. Let's create a new configuration named 'shakespeare-tuner-exp1'.

Given that we're not introducing new information or complex concepts to the model but rather teaching it to apply its learned language style to new data (modern English sentences), we'll use a low-rank approach. We'll adjust our initial strategy from alpha = (2 x rank) to alpha = rank and observe the impact. Additionally, we'll increase the dropout rate from 0.2 to 0.3, reduce the number of layers and attention heads from 6 to 4, and also decrease the batch size and context length to 32 and 128, respectively. Let's see how these changes affect the model performance:

18.png

Noticeably, the validation loss is taking longer to increase, indicating persistent overfitting.

Let's proceed to adjust other parameters, such as reducing the learning rate from 3e-4 (0.0003) to facilitate smaller steps, fostering a slower and more stable convergence. We'll implement these changes in a new configuration named “shakespeare-tuner-exp2”.

22.pngAs we can clearly see, modifying and experimenting with hyperparameters can lead to significant changes in how the model runs and performs. Through a few experiments, we managed to achieve a better balance between training and validation losses, reduce overfitting, and unfortunately, increase execution time (what a shame! ☹️)… but that's how it goes, doesn't it?

23.pngWhile we're 'guessing' at the best hyperparameters here, in practice, systematic methods like Hyperparameter Tuning and Neural Architecture Search (NAS) are incredibly valuable for defining the optimal model architecture and hyperparameters. You might want to explore these methods to fine-tune your model effectively.

Moreover, assessing the computational requirements of your model is important before deployment. You can use various techniques including benchmarking on sample data, monitoring resource usage with system and profiling tools, analyzing model size and batch sizes, considering distributed training options, and exploring cloud-based experiments for scalability and optimization. Before moving a model to production, evaluating its computational demands involves analyzing its algorithmic time and space complexities using techniques like Big O notation and assessing the number of floating-point operations (FLOPs) executed during inference or training to estimate workload and efficiency.

Well, the name "fine-tuning" has never been so clear, has it? 😊

By the way, after some experiments, we achieved great performance with our final fine-tuned model. Just wait for it!

Leveraging Logs for Efficient Debugging in AI Deployments

One aspect we haven't discussed much is the importance of logs. Implementing logging mechanisms is highly beneficial for various reasons, particularly for root cause analysis in execution and deployment scenarios (whether local or in the cloud). Developing a robust logging strategy can yield numerous advantages...

You can leverage logs for various purposes such as Performance Monitoring, Model Monitoring and Drift Detection, Security and Compliance Auditing, Resource Utilization and Cost Optimization, User Behavior Analysis, Predictive Maintenance, and more.

In our context, let's briefly explore a fraction of what logging can offer, specifically focusing on troubleshooting and debugging. Logs play a critical role in identifying and resolving issues within AI systems by capturing detailed information about system behavior, errors, warnings, and exceptions encountered during execution. This information is invaluable for diagnosing the root cause of failures or unexpected behaviors.

19.png

 20.png

Here we go! We've successfully implemented LoRA in our fine-tuned model for Shakespeare-style transfer. Now, let's proceed to the next step: serving the models.

I hope you're as excited about this as I am! 😉

 

Wrapping Up and Next Steps

Congratulations on implementing LoRA from scratch! In this blog, we've covered the essential steps to get your Shakespearean Language Model fine-tuned and ready for deployment using SAP AI Core and SAP AI Launchpad.

Let's recap what we've covered:

  • Understanding Fine-Tuning: We explored the concept of fine-tuning and its importance in machine learning.
  • Defining the Fine-Tuning Task: We set up a task for language style transfer to Shakespearean English.
  • Curating the Dataset: We discussed the importance of a well-structured dataset and provided examples of modern English sentences paired with their Shakespearean equivalents.
  • Performance Efficient Fine-Tuning (PEFT): We examined how PEFT makes fine-tuning large language models more efficient.
  • Low-Rank Adaptation (LoRA): We delved into LoRA and its benefits for optimizing the fine-tuning process.
  • Implementing LoRA: We provided a step-by-step guide to implementing LoRA in your models using PyTorch.
  • Deploying the Fine-Tuning Workflow: We learned how to deploy your fine-tuning workflow using SAP AI Core.

Next Steps

Now that we've laid the groundwork for a successful fine-tuning deployment with LoRA, stay tuned for the upcoming blogs in this series, where we'll explore further steps and functionalities:

 

Further References