Technology Blogs by SAP
Learn how to extend and personalize SAP applications. Follow the SAP technology blog for insights into SAP BTP, ABAP, SAP Analytics Cloud, SAP HANA, and more.
cancel
Showing results for 
Search instead for 
Did you mean: 
mathis_boerner
Associate
Associate
4,835

Authors: Salma Sohrabi-Jahromi, Mathis Börner

In this blog post, we'll explore the fine-tuning of Large Language Models (LLMs), explain its purpose and methodology, and provide SAP’s perspective. This blog post is part of a series that dives into various aspects of SAP’s approach to generative AI, and its technical underpinnings. In the previous blog post of this series, we discussed how SAP is leveraging generative AI advancements to build the SAP Foundation Model, a table-native AI model for predictive tasks based on tabular data. Also read the first blog post of the series.

Business applications need specialized solutions

The breakthrough of LLMs is largely based on their impressive zero-shot capabilities. This means that the model can follow complex instructions and perform tasks without an example or additional training phase. These impressive skills come from the pre-training of the models. While pre-trained language models possess extensive general knowledge, deploying them directly in business settings often leads to challenges like generating inaccurate information (hallucinations), relying on outdated facts, and the inability to integrate proprietary corporate data. To enhance their accuracy and efficacy for business applications, it's essential to implement model customization strategies:

fig1.png

Figure 1:  Model customization can be achieved in several ways. For applications, it is important to find a balance between the cost of implementation and the customization required for the use case. Often, a combination of methods will yield the best results.

  • In-context learning: This approach leaves the model untouched but introduces additional context into the prompts dynamically from external databases. This helps anchor the model, providing more precise and dependable responses. We can utilize databases of relevant Q&A pairs as examples in the prompt (one/multi-shot prompting) or incorporate informative text snippets (Retrieval-Augmented Generation or RAG). Read this blog at Nvidia.com for a starter on RAG.
  • Fine-tuning: Fine-tuning is the process of adjusting the pre-trained language model's parameters to better suit specific tasks or domains. This exposure to targeted examples during fine-tuning allows the model to grasp subtle details specific to the domain, effectively bridging the gap between a generalist model and one tailored for specialized tasks. This is key to unlocking the full potential of LLMs in distinct domains or applications.

How to choose between the two?

Implementing strategies like prompt engineering, multi-shot prompting, and RAG can significantly enhance the accuracy and reliability of generative AI applications and enable quick adoption of many generative AI use scenarios. Read our previous blog post on benchmarking to learn more. In-context learning is also the appropriate strategy for providing the model with up-to-date data, user-specific or internal data, since this data can be retrieved or access to the data can be authorized at the time of the request. Especially for tasks where the model is supposed to use or provide factual knowledge, it is always advisable to provide external data as part of the input so that the model can ground its answer to this data. With the advent of LLMs featuring million-token context windows, research has shown that these models, when employed with in-context learning using a high number of examples (>>100), can match the performance of fine-tuned models. This highlights the importance of first establishing a baseline and with in-context learning for comparison with any fine-tuned model. SAP AI Launchpad provides the ideal environment for prompt engineering and in-context learning experiments. It allows to run a prompt with many different models and compared the outputs without writing a single line of code.

However, In-context learning requires that new model expectations be easily verbalized. In some scenarios, this is extremely difficult or even impossible, especially when not only new knowledge but also new skills or specific behaviors are expected. For example, it is impractical to teach an LLM a new language or a completely new domain through in-context learning alone. Similarly, if a task requires a very specific behavior or response structure, instructing the model only via prompting may not result in a sufficiently reliable response. This may be especially problematic if the LLM response is consumed by software and not intended for a human, that can interpret or correct the response.

In such cases, fine-tuning can make a significant improvement or apply necessary adjustment to the models. Fine-tuning can be essential to achieve top-tier performance in specific business domains and where LLMs are embedded in a business application to take over a complex logic. Top-tier performance can mean improving the quality of LLM responses, but fine-tuning can also be used to make models significantly more reliable and consistent in the way they perform their task. Fine-tuned models can also be used to significantly reduce the size of models, potentially reducing costs and response times. Additionally, by using smaller task-specific models instead of larger general purpose LLMs, fine-tuning contributes to sustainability by reducing carbon emissions associated with operating LLMs. Finally, fine-tuning allows to leverage ongoing user feedback, facilitating continuous improvement over time, and thus perpetually enhancing the user experience through iterative updates.

Thus, there are obvious situations, where fine-tuning turns out to be a highly useful technique. In the following sections, we will discuss specific examples where SAP is delivering new capabilities based on fine-tuned LLMs. Developing these capabilities has taught us a few lessons about fine-tuning and has highlighted typical and recurring challenges. The second chapter discusses these challenges and how SAP solutions can help overcome them.

LLMs Fine-tuned by SAP

At SAP, we are committed to developing generative AI applications that embody three fundamental principles: relevant, reliable, and responsible. Read this blog post to know more. These principles, explained again below, ensure that our AI solutions are not only tailored to the specific needs of businesses but also robust and ethically sound.

  • Relevant: Our AI solutions are intricately designed with the business context in mind. By embedding AI into our leading enterprise applications and technology solutions, we ensure that SAP systems are equipped with the most pertinent AI capabilities, making them future-proof and highly relevant to specific business needs.
  • Reliable: Our AI is developed based on specific industry and business process data, and SAP's deep understanding of these processes ensures that our AI solutions are dependable and seamlessly integrate into daily business operations.
  • Responsible: We have a longstanding commitment to delivering AI solutions that uphold the highest standards of security, privacy, compliance, and ethics.

To deliver tailored, high-quality solutions that adhere to these principles, we fine-tune LLMs to achieve results that not only enhance the capabilities of our AI solutions but also ensure they adhere to our security and compliance values. Here are two such examples:

Building generative AI capabilities for ABAP:

With the advancements in LLMs and the development of coding assistants for programming languages such as Java or Python, we aim to provide similar efficiency gains to ABAP developers as well. However, integrating LLMs out-of-the-box poses challenges: There is a limited amount of relevant training data publicly available which results in an inferior developer experience compared to other widely used programming languages. Teaching a model to write high quality and syntactically correct code in a new language is not possible through in-context learning alone. SAP can provide AI powered tools for its developer community by leveraging the following resources:

  1. Access to an extensive internal codebase: SAP has access to hundreds of millions of lines of internal source code that is not publicly available and can serve as ideal training data.
  2. Collaboration with expert ABAP community: With the help of the ABAP expert community at SAP, we can build high quality annotated datasets for various scenarios and use cases such as code generation and explanation, to further adapt our models to specific tasks.
  3. Integration within ABAP Development Tools (ADT): By integrating these models in ADT, we plan to not only make them seamless to use, but also increase the quality of their responses by taking the ADT user context into account when answering a question or generating code.
  4. Continuous learning and improvement: By leveraging feedback from our ABAP community and adapting to evolving coding standards, we can ensure the fine-tuned models remain accurate and relevant over time.

 With this vision, we are leveraging fine-tuning to develop generative models with an enhanced understanding of ABAP code and development objects. This strategic initiative is developed in partnership with NVIDIA, and will not only help to boost developer efficiency, but also foster the highest standards of data security and user privacy.

fig2.png

 Figure 2: Example code completion using our solution compared with the GPT prediction

 Developing custom LLMs for document processing:

SAP is actively involved in intelligent document processing to address the prevalent challenge of managing business documents that are still processed manually in many sectors. These could be documents such as invoices in accounts payable or delivery notes in logistics. We have added Document Information Extraction, premium edition to our AI services. It utilizes LLMs to automate the extraction of unstructured data, transforming it into structured formats for seamless integration into business workflows. The premium edition is based on generative AI, which not only enables extraction with customizable schemas but also supports a wide range of document types and extends language capabilities to over 40 languages, thereby accelerating various business operations and reducing time-to-value. Read this blog post to also see roadmap and outlook of Document Information Extraction, premium edition.

To achieve high extraction accuracy, essential for fully automating manual workflows, we recognize that mere in-context learning, and prompt engineering might not always lead to the desired precision required for complex document processing tasks. To address this, SAP is exploring fine-tuning techniques to adapt LLMs so that they are especially suited for document information extraction. To develop such a solution, we are leveraging an internal pool of anonymized business documents to both obtain a high fidelity for the most common business document types used by our customers and maintain the flexibility to expand to new use cases. As you can see in Figure 3, SAP’s fine-tuning technique can boost the extraction results. A much smaller base model (with 7 billion trainable parameters) can exceed the accuracy of a much larger general-purpose model while retaining the capacity to generalize to unseen document types. If some examples of the specific document type are included in the training data, the accuracy is enhanced further. Overall, fine-tuning would not only boost the model’s information extraction quality but would also unlock continuous improvements and customizations based on user feedback.

fig3.png

 Figure 3: Benchmark results for Document Information Extraction on a subset of header and line item fields

What to consider for your own fine-tuning projects?

Before jumping into a fine-tuning project, there are a few aspects to be aware of at the beginning.

Training data

The most challenging aspect of fine-tuning projects is often the training data. To grasp the data requirements, it is essential to understand the three stages of LLM training: pre-training, instruction training, and alignment training. In the pre-training stage, the model is trained on a large corpus of text through a self-supervised next token prediction task. This stage is followed by supervised fine-tuning using instruction-output pairs to refine the model's adherence to specific prompts. The final stage often involves further fine-tuning using for example Reinforcement Learning from Human Feedback (RLHF) to align the model's responses more closely to human preferences. For LLM fine-tuning, the focus is usually on the instruction and alignment training stages. The catch with those two stages is that they need labeled data. For most use cases, such data is not available and must be collected or generated, involving significant time and cost. If the quantity and quality of the data is not sufficient, training is likely to fail because low data quality can’t be fixed in the later steps. Research has also given hints that for instruction training and alignment training, data quality and variety may be more important than the sheer volume of data. SAP systems or data solutions like SAP Datasphere can be the basis for getting the data needed for fine-tuning from SAP applications. This data can enhance LLM based application when used for in-context learning as additional context and of course also for fine-tuning.

Evaluation data and benchmarking process

There are many benchmarks for LLMs that try to capture the general capabilities of a model, but there isn't even a clear definition of what general model capabilities mean, so a reliable measure of that is not available at the moment (Clémentine Fourrier of Hugging Face wrote a nice blog post about the state of LLM benchmarking). So do not blindly use generic benchmarks should not be used to predict task- or domain-specific performance – see our earlier blog post in this series to learn more on how SAP approaches LLM benchmarking. Evaluation data can be a hold-out subset of the training data, but next token prediction accuracy which is used during training is not a suitable measure for task-specific performance, so in addition to defining an evaluation data set, an evaluation procedure must be defined that sufficiently captures the performance as experienced in the application. Without a proper evaluation procedure and data, it is impossible to judge the improvements achieved by fine-tuning and steer the project in the right direction. We are exploring ways to facilitate developers collecting data and user feedback for applications built using the generative AI hub in SAP AI Core.

Collecting user feedback in AI applications can be critical for generating training data for model fine-tuning, but most importantly, a user feedback system can provide direct insights into the improvements achieved by a model update. Using LLMs as judges to test a model before deploying it in the application has emerged as a new, cost-effective evaluation method. We are working to make these approaches available to our users through the SAP AI Launchpad. To learn more about using LLMs as judges, see the blog post on benchmarking LLMs mentioned above.

Cost and infrastructure

Both collecting data and defining an evaluation process cost time and money. Another major cost factor is infrastructure. The infrastructure needed depends on three factors: the amount of training data, the model size, and the chosen training method. In terms of model size for fine-tuning, the general guideline is to choose the smallest model capable of accomplishing the task. This helps keep costs within budget and improves throughput and latency when used in the application. There are two different flavors of training methods:

  • Full fine-tuning: This method considers every model parameter as trainable and seeks to update each according to the new dataset. While often yielding superior task-specific performance (if provided with sufficient data), it demands significant training time, costs, and infrastructure.
  • Parameter efficient fine-tuning (PEFT): Here, only a limited number of model parameters are fine-tuned, with the majority of the pretrained LLM's parameters kept frozen. This considerably reduces both computational and storage overheads. Typical examples are Low Rank Adaptation (LoRA), prefix tuning and prompt tuning.

The degree to which the model can be adapted is limited when using PEFT compared to full fine-tuning. But as a trade-off, strong results can be achieved with significantly less data and compute power, as this Cornell University article suggests. A rule of thumb is that both the data and compute requirements are about two orders of magnitude lower with LoRA than with full fine-tuning. Whether PEFT fine-tuning is sufficient to boost the fidelity of a LLM depends on the use case. But recent results, published in this report, show that in many scenarios LoRA fine-tuning can boost small models to or above GPT4 level. While training costs can be very high, inference costs can be even more important. It is estimated that 70-90% of the resources used in AI are used for inference, as noted by this report by Schneider Electric, or this report by Facebook AI.

As a result, if the running costs of a model are too high, there is little chance that fine-tuning will pay off.  When it comes to the cost of running LLMs, often the focus is solely on the cost of infrastructure and other costs are often overlooked. They include infrastructure costs such as networking, autoscaling, logging, but also most importantly, the cost of development, integration and operations staff. For an overview of cost of generative AI models in production, refer to this blog by Hugging Face’s Phil Schmid. This is where AI Core and its generative AI hub capabilities as a service in SAP Business Technology Platform can be a great value. See this blog post introducing generative AI Hub. It is designed to handle the execution and operations of your AI assets in a standardized, scalable, and hyperscaler-agnostic way. SAP AI Core provides seamless integration with your SAP solutions. Any AI function can be easily realized using open-source frameworks. SAP AI Core supports full lifecycle management of AI scenarios. To further improve our offering, we are exploring how we can make the deployment of fine-tuned models more cost-effective by sharing the infrastructure and thus the costs for many models, without compromising on reliability and security. But even today open-source LLMs and fine-tuned variants of them can be deployed on SAP AI Core. A comprehensive blog series on how to do so can be found here.

Expertise and compliance

In addition to the technical aspects mentioned above, there are other issues that are critical to a successful project. Technical proficiency is indispensable for successful projects. Without available expertise, projects face a steep learning curve. Additionally, fine-tuning and deploying LLMs involves greater responsibility than simply using pre-built models through platforms like the generative AI hub in SAP AI Core. Compliance is also a significant concern; even open-source models with permissive licenses may have restrictions, particularly regarding fine-tuning. Furthermore, when selecting training data, compliance with legal standards is essential since all model users indirectly access the training data's content. This means that it is not possible to control user-specific access to the knowledge contained in the data. Therefore, we need to carefully select training data and consider the legal implications. We are evaluating how we can support our customers by providing an E2E fine-tuning as a service, based on SAP and other data. Furthermore, a pre-selected list of available models would help prevent compliance issues.

Conclusion

 While building LLM-based applications with in-context learning is very accessible to organizations without extensive AI experience, fine-tuning projects can be a different beast. And there is one clear message: validate your use cases via in-context learning before starting with fine-tuning. All the hurdles discussed here will be familiar to people who have worked on data-driven projects in the past, as they are like many of the challenges known independently from generative AI and LLMs. This blog on ScholarSpace is a good resource to understand why certain data-driven projects do not succeed. This is not to discourage you from starting your own fine-tuning projects, but rather to encourage you to plan and evaluate carefully before embarking on a project, and to strategically seek out partners who can fill potential gaps.

At SAP, we understand the complexities and potential of fine-tuning LLMs for various business scenarios. At the same time as exploring fine-tuning to deliver the best user experience on our end, we are dedicated to exploring ways to support you with fine-tuning, ensuring that you can achieve deep augmentation and adjustments of models tailored to your specific business needs and data. Leveraging today capabilities of SAP AI Core can already provide significant support in managing the computational demands of fine-tuning projects and deploying fine-tuning models in production. And we are committed to helping you navigate these challenges and improving our offering to unlock the full potential of your fine-tuning initiatives. Stay tuned for updates on our developments in this direction!

Co-authored by: Akhil Agarwal, Henning Heitkötter, Philipp Herzig, and Christiane Kubach

6 Comments
mhaas123
Advisor
Advisor

Great article, thanks a lot! I really appreciate the links to relevant reading material as well 😀

chrishoertnagl
Advisor
Advisor

Amazing post, awesome explanations 😃

Jelena_Perfiljeva
Active Contributor

Nice article.

The SAP solution for ABAP code is incorrect though. I'm in S/4HANA 2022 system and IF_FIN_RE_DATA_CACHE method CONTAINS doesn't have ER_DATA in the signature. Which makes sense because based on the method's name, I'd expect it to return just true/false value. There is another method (GET), shown in the screenshot, which I'd expect to provide the data.

It's possible that whatever system this screenshot is from could have a different method signature (and ChatGPT didn't get a memo on this). But again, based on the name, I can't think of any reason to do that.

Not to mention if I'm correct about the method's signature (it just needs to check whether a record exists), then I'd add TRANSPORTING NO FIELDS for better performance. And SAP added all that cryptic new syntax like VALUE etc. and now your AI is not using it and uses READ TABLE? The air is thick with irony here. 🙂

Bottom line, I wouldn't rush to declare a victory just yet.

Jelena_0-1721241949198.png

 

MaximilienFourmy
Explorer

Thank you for the article, it is very interesting and promising!

The coding experience with AI code editors (like Cursor with the model Claude 3.5 Sonnet) is really great when working with JavaScript or Python.
If we can have a similar experience with ABAP, it will have a huge impact on development speed and efficiency.

For information, I checked the S/4HANA 2023 version and indeed, there is this new parameter ER_DATA for the method CONTAINS.

MaxFou_0-1721381036645.png

I asked Claude 3.5 Sonnet to apply best practices to the code and the result looks really great.
System Prompt:
You are an expert in ABAP Clean Code and you refactor the provided code using the latest best practices
User Prompt:

READ TABLE mt_entries WIT KEY key = iv_key
REFERENCE INTO DATA(lr_entry).
IF sy-subrc = 0.
  ev_found = abap_true.
  er_data = lr_entry->data.
ELSE.
  ev_found = abap_false.
ENDIF.


Model response:
Here's a refactored version of the code using ABAP Clean Code principles:

METHOD read_entry.
  TRY.
      DATA(lr_entry) = REF #( mt_entries[ key = iv_key ] ).
      ev_found = abap_true.
      er_data  = lr_entry->data.
    CATCH cx_sy_itab_line_not_found.
      ev_found = abap_false.
      CLEAR er_data.
  ENDTRY.
ENDMETHOD.

Explanation of the changes:
1. Used a method structure with meaningful name `read_entry` to encapsulate the logic.
2. Replaced the `READ TABLE` statement with the more modern and concise table expression `mt_entries[ key = iv_key ]`.
3. Utilized the `TRY-CATCH` block to handle the case when the entry is not found, which is cleaner than checking `sy-subrc`.
4. Used inline declaration for `lr_entry` with `REF #( )` to directly get a reference to the table entry.
5. Set `ev_found` to `abap_true` only when the entry is found, and `abap_false` in the `CATCH` block when it's not found.
6. Added `CLEAR er_data` in the `CATCH` block to ensure `er_data` is empty when the entry is not found.
7. Removed the `IF-ELSE` structure, making the code more straightforward and reducing nesting. This refactored version is more readable, uses modern ABAP syntax, and follows clean code principles by making the intent clearer and reducing the complexity of the code.


Max

LeonardoAraujo
Active Contributor

Very good BLOG.

Pushed on the same button via my BLOG 

Availability of an ABAP-Fine-tuned model is All You Need

🙂

zfiori
Participant
0 Kudos

Hi  Mathis,

Such a great article, thanks a lot!

I really appreciate the links to relevant reading material as well.

It help our team a lot.

Appreciated it.

 

Regards,

ZFiori.

 

🙂