Authors: Salma Sohrabi-Jahromi, Mathis Börner
In this blog post, we'll explore the fine-tuning of Large Language Models (LLMs), explain its purpose and methodology, and provide SAP’s perspective. This blog post is part of a series that dives into various aspects of SAP’s approach to generative AI, and its technical underpinnings. In the previous blog post of this series, we discussed how SAP is leveraging generative AI advancements to build the SAP Foundation Model, a table-native AI model for predictive tasks based on tabular data. Also read the first blog post of the series.
The breakthrough of LLMs is largely based on their impressive zero-shot capabilities. This means that the model can follow complex instructions and perform tasks without an example or additional training phase. These impressive skills come from the pre-training of the models. While pre-trained language models possess extensive general knowledge, deploying them directly in business settings often leads to challenges like generating inaccurate information (hallucinations), relying on outdated facts, and the inability to integrate proprietary corporate data. To enhance their accuracy and efficacy for business applications, it's essential to implement model customization strategies:
Figure 1: Model customization can be achieved in several ways. For applications, it is important to find a balance between the cost of implementation and the customization required for the use case. Often, a combination of methods will yield the best results.
Implementing strategies like prompt engineering, multi-shot prompting, and RAG can significantly enhance the accuracy and reliability of generative AI applications and enable quick adoption of many generative AI use scenarios. Read our previous blog post on benchmarking to learn more. In-context learning is also the appropriate strategy for providing the model with up-to-date data, user-specific or internal data, since this data can be retrieved or access to the data can be authorized at the time of the request. Especially for tasks where the model is supposed to use or provide factual knowledge, it is always advisable to provide external data as part of the input so that the model can ground its answer to this data. With the advent of LLMs featuring million-token context windows, research has shown that these models, when employed with in-context learning using a high number of examples (>>100), can match the performance of fine-tuned models. This highlights the importance of first establishing a baseline and with in-context learning for comparison with any fine-tuned model. SAP AI Launchpad provides the ideal environment for prompt engineering and in-context learning experiments. It allows to run a prompt with many different models and compared the outputs without writing a single line of code.
However, In-context learning requires that new model expectations be easily verbalized. In some scenarios, this is extremely difficult or even impossible, especially when not only new knowledge but also new skills or specific behaviors are expected. For example, it is impractical to teach an LLM a new language or a completely new domain through in-context learning alone. Similarly, if a task requires a very specific behavior or response structure, instructing the model only via prompting may not result in a sufficiently reliable response. This may be especially problematic if the LLM response is consumed by software and not intended for a human, that can interpret or correct the response.
In such cases, fine-tuning can make a significant improvement or apply necessary adjustment to the models. Fine-tuning can be essential to achieve top-tier performance in specific business domains and where LLMs are embedded in a business application to take over a complex logic. Top-tier performance can mean improving the quality of LLM responses, but fine-tuning can also be used to make models significantly more reliable and consistent in the way they perform their task. Fine-tuned models can also be used to significantly reduce the size of models, potentially reducing costs and response times. Additionally, by using smaller task-specific models instead of larger general purpose LLMs, fine-tuning contributes to sustainability by reducing carbon emissions associated with operating LLMs. Finally, fine-tuning allows to leverage ongoing user feedback, facilitating continuous improvement over time, and thus perpetually enhancing the user experience through iterative updates.
Thus, there are obvious situations, where fine-tuning turns out to be a highly useful technique. In the following sections, we will discuss specific examples where SAP is delivering new capabilities based on fine-tuned LLMs. Developing these capabilities has taught us a few lessons about fine-tuning and has highlighted typical and recurring challenges. The second chapter discusses these challenges and how SAP solutions can help overcome them.
At SAP, we are committed to developing generative AI applications that embody three fundamental principles: relevant, reliable, and responsible. Read this blog post to know more. These principles, explained again below, ensure that our AI solutions are not only tailored to the specific needs of businesses but also robust and ethically sound.
To deliver tailored, high-quality solutions that adhere to these principles, we fine-tune LLMs to achieve results that not only enhance the capabilities of our AI solutions but also ensure they adhere to our security and compliance values. Here are two such examples:
Building generative AI capabilities for ABAP:
With the advancements in LLMs and the development of coding assistants for programming languages such as Java or Python, we aim to provide similar efficiency gains to ABAP developers as well. However, integrating LLMs out-of-the-box poses challenges: There is a limited amount of relevant training data publicly available which results in an inferior developer experience compared to other widely used programming languages. Teaching a model to write high quality and syntactically correct code in a new language is not possible through in-context learning alone. SAP can provide AI powered tools for its developer community by leveraging the following resources:
With this vision, we are leveraging fine-tuning to develop generative models with an enhanced understanding of ABAP code and development objects. This strategic initiative is developed in partnership with NVIDIA, and will not only help to boost developer efficiency, but also foster the highest standards of data security and user privacy.
Figure 2: Example code completion using our solution compared with the GPT prediction
Developing custom LLMs for document processing:
SAP is actively involved in intelligent document processing to address the prevalent challenge of managing business documents that are still processed manually in many sectors. These could be documents such as invoices in accounts payable or delivery notes in logistics. We have added Document Information Extraction, premium edition to our AI services. It utilizes LLMs to automate the extraction of unstructured data, transforming it into structured formats for seamless integration into business workflows. The premium edition is based on generative AI, which not only enables extraction with customizable schemas but also supports a wide range of document types and extends language capabilities to over 40 languages, thereby accelerating various business operations and reducing time-to-value. Read this blog post to also see roadmap and outlook of Document Information Extraction, premium edition.
To achieve high extraction accuracy, essential for fully automating manual workflows, we recognize that mere in-context learning, and prompt engineering might not always lead to the desired precision required for complex document processing tasks. To address this, SAP is exploring fine-tuning techniques to adapt LLMs so that they are especially suited for document information extraction. To develop such a solution, we are leveraging an internal pool of anonymized business documents to both obtain a high fidelity for the most common business document types used by our customers and maintain the flexibility to expand to new use cases. As you can see in Figure 3, SAP’s fine-tuning technique can boost the extraction results. A much smaller base model (with 7 billion trainable parameters) can exceed the accuracy of a much larger general-purpose model while retaining the capacity to generalize to unseen document types. If some examples of the specific document type are included in the training data, the accuracy is enhanced further. Overall, fine-tuning would not only boost the model’s information extraction quality but would also unlock continuous improvements and customizations based on user feedback.
Figure 3: Benchmark results for Document Information Extraction on a subset of header and line item fields
Before jumping into a fine-tuning project, there are a few aspects to be aware of at the beginning.
Training data
The most challenging aspect of fine-tuning projects is often the training data. To grasp the data requirements, it is essential to understand the three stages of LLM training: pre-training, instruction training, and alignment training. In the pre-training stage, the model is trained on a large corpus of text through a self-supervised next token prediction task. This stage is followed by supervised fine-tuning using instruction-output pairs to refine the model's adherence to specific prompts. The final stage often involves further fine-tuning using for example Reinforcement Learning from Human Feedback (RLHF) to align the model's responses more closely to human preferences. For LLM fine-tuning, the focus is usually on the instruction and alignment training stages. The catch with those two stages is that they need labeled data. For most use cases, such data is not available and must be collected or generated, involving significant time and cost. If the quantity and quality of the data is not sufficient, training is likely to fail because low data quality can’t be fixed in the later steps. Research has also given hints that for instruction training and alignment training, data quality and variety may be more important than the sheer volume of data. SAP systems or data solutions like SAP Datasphere can be the basis for getting the data needed for fine-tuning from SAP applications. This data can enhance LLM based application when used for in-context learning as additional context and of course also for fine-tuning.
Evaluation data and benchmarking process
There are many benchmarks for LLMs that try to capture the general capabilities of a model, but there isn't even a clear definition of what general model capabilities mean, so a reliable measure of that is not available at the moment (Clémentine Fourrier of Hugging Face wrote a nice blog post about the state of LLM benchmarking). So do not blindly use generic benchmarks should not be used to predict task- or domain-specific performance – see our earlier blog post in this series to learn more on how SAP approaches LLM benchmarking. Evaluation data can be a hold-out subset of the training data, but next token prediction accuracy which is used during training is not a suitable measure for task-specific performance, so in addition to defining an evaluation data set, an evaluation procedure must be defined that sufficiently captures the performance as experienced in the application. Without a proper evaluation procedure and data, it is impossible to judge the improvements achieved by fine-tuning and steer the project in the right direction. We are exploring ways to facilitate developers collecting data and user feedback for applications built using the generative AI hub in SAP AI Core.
Collecting user feedback in AI applications can be critical for generating training data for model fine-tuning, but most importantly, a user feedback system can provide direct insights into the improvements achieved by a model update. Using LLMs as judges to test a model before deploying it in the application has emerged as a new, cost-effective evaluation method. We are working to make these approaches available to our users through the SAP AI Launchpad. To learn more about using LLMs as judges, see the blog post on benchmarking LLMs mentioned above.
Cost and infrastructure
Both collecting data and defining an evaluation process cost time and money. Another major cost factor is infrastructure. The infrastructure needed depends on three factors: the amount of training data, the model size, and the chosen training method. In terms of model size for fine-tuning, the general guideline is to choose the smallest model capable of accomplishing the task. This helps keep costs within budget and improves throughput and latency when used in the application. There are two different flavors of training methods:
The degree to which the model can be adapted is limited when using PEFT compared to full fine-tuning. But as a trade-off, strong results can be achieved with significantly less data and compute power, as this Cornell University article suggests. A rule of thumb is that both the data and compute requirements are about two orders of magnitude lower with LoRA than with full fine-tuning. Whether PEFT fine-tuning is sufficient to boost the fidelity of a LLM depends on the use case. But recent results, published in this report, show that in many scenarios LoRA fine-tuning can boost small models to or above GPT4 level. While training costs can be very high, inference costs can be even more important. It is estimated that 70-90% of the resources used in AI are used for inference, as noted by this report by Schneider Electric, or this report by Facebook AI.
As a result, if the running costs of a model are too high, there is little chance that fine-tuning will pay off. When it comes to the cost of running LLMs, often the focus is solely on the cost of infrastructure and other costs are often overlooked. They include infrastructure costs such as networking, autoscaling, logging, but also most importantly, the cost of development, integration and operations staff. For an overview of cost of generative AI models in production, refer to this blog by Hugging Face’s Phil Schmid. This is where AI Core and its generative AI hub capabilities as a service in SAP Business Technology Platform can be a great value. See this blog post introducing generative AI Hub. It is designed to handle the execution and operations of your AI assets in a standardized, scalable, and hyperscaler-agnostic way. SAP AI Core provides seamless integration with your SAP solutions. Any AI function can be easily realized using open-source frameworks. SAP AI Core supports full lifecycle management of AI scenarios. To further improve our offering, we are exploring how we can make the deployment of fine-tuned models more cost-effective by sharing the infrastructure and thus the costs for many models, without compromising on reliability and security. But even today open-source LLMs and fine-tuned variants of them can be deployed on SAP AI Core. A comprehensive blog series on how to do so can be found here.
Expertise and compliance
In addition to the technical aspects mentioned above, there are other issues that are critical to a successful project. Technical proficiency is indispensable for successful projects. Without available expertise, projects face a steep learning curve. Additionally, fine-tuning and deploying LLMs involves greater responsibility than simply using pre-built models through platforms like the generative AI hub in SAP AI Core. Compliance is also a significant concern; even open-source models with permissive licenses may have restrictions, particularly regarding fine-tuning. Furthermore, when selecting training data, compliance with legal standards is essential since all model users indirectly access the training data's content. This means that it is not possible to control user-specific access to the knowledge contained in the data. Therefore, we need to carefully select training data and consider the legal implications. We are evaluating how we can support our customers by providing an E2E fine-tuning as a service, based on SAP and other data. Furthermore, a pre-selected list of available models would help prevent compliance issues.
While building LLM-based applications with in-context learning is very accessible to organizations without extensive AI experience, fine-tuning projects can be a different beast. And there is one clear message: validate your use cases via in-context learning before starting with fine-tuning. All the hurdles discussed here will be familiar to people who have worked on data-driven projects in the past, as they are like many of the challenges known independently from generative AI and LLMs. This blog on ScholarSpace is a good resource to understand why certain data-driven projects do not succeed. This is not to discourage you from starting your own fine-tuning projects, but rather to encourage you to plan and evaluate carefully before embarking on a project, and to strategically seek out partners who can fill potential gaps.
At SAP, we understand the complexities and potential of fine-tuning LLMs for various business scenarios. At the same time as exploring fine-tuning to deliver the best user experience on our end, we are dedicated to exploring ways to support you with fine-tuning, ensuring that you can achieve deep augmentation and adjustments of models tailored to your specific business needs and data. Leveraging today capabilities of SAP AI Core can already provide significant support in managing the computational demands of fine-tuning projects and deploying fine-tuning models in production. And we are committed to helping you navigate these challenges and improving our offering to unlock the full potential of your fine-tuning initiatives. Stay tuned for updates on our developments in this direction!
Co-authored by: Akhil Agarwal, Henning Heitkötter, Philipp Herzig, and Christiane Kubach
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
User | Count |
---|---|
28 | |
14 | |
13 | |
12 | |
11 | |
9 | |
7 | |
6 | |
5 | |
5 |