Technology Blog Posts by SAP
cancel
Showing results for 
Search instead for 
Did you mean: 
FelixSasaki
Product and Topic Expert
Product and Topic Expert
38,419

KGBlog.jpg

This blog post is part of a series that dives into various aspects of SAP’s approach to Generative AI, and its technical underpinnings.

In previous blog posts of this series, you learned about how to use large language models (LLMs) for developing AI applications in a trustworthy and reliable manner. We also explained various engineering best practices for generative AI and how foundation models on structured data enable the creation of new AI use cases with less manual effort for downstream tasks.

Read the first blog post of the series.

In generative AI application development or techniques like fine tuning, Knowledge Graphs can play a crucial role. For SAP, knowledge graphs are a key means to leverage our unique structured and unstructured knowledge assets (business data models, business process metadata and documentation) for generative AI scenarios. Knowledge graphs contribute to grounding for LLMs, address challenges like hallucinations, and provide semantics to SAP’s Foundation Model (read blog post).

In this blog post you will learn what knowledge graphs are and why we, at SAP, use them to build generative AI solutions. We will then dive into the first knowledge graph solution developed for one of the generative AI scenarios. We will explain how knowledge graphs are used in concrete, but still experimental use cases and how we advance the state-of-the-art technologies to combine LLMs and knowledge graphs. We will close with our vision on evolving knowledge graphs for all generative AI solutions at SAP.

1. Background on knowledge graphs

Let’s start with a short experiment: use your favorite Web search engine and type “SAP SE”. What do you get as part of the result? The answer is: you will see, in addition to a list of Web pages, a comprehensive overview of SAP, drawn from a knowledge graph. Parts of the content of the knowledge graph are visualized in “Figure 1: A part of a knoweldge graphs about SAP SE and related entities”.

fsasaki_1-1722525268376.png

Figure 1: A part of a knoweldge graphs about SAP SE and related entities

The term “knowledge graph” was introduced in 2013. Today, knowledge graphs are a common part of our daily Web search experience. In our experiment, search results not only include a list of Web pages with the string “SAP SE”, but also provide a structured and interconnected network of information about the entity “SAP SE”. This demonstrates a key aspect of knowledge graphs: they encode “things, not strings”, that is: entities, their properties and relations to other entities.

In the knowledge graph, you will see properties of the entity “SAP SE”, for example its description or a link to the SAP website. You will also see relations to other entities in the knowledge graph, for example headquarters of SAP. Clicking e.g. on “Walldorf” will lead you to the knowledge graph entry for Walldorf, including other types of properties and relations, like the population “14,646” and the “region” relation to the entity “Karlsruhe”.

You learn another aspect of knowledge graphs from this example: the entities, their properties and relations are generated based on existing heterogenous data sources. And in contrast to other data integration approaches using data warehouses with one unified schema, there is no common data model across these data sources. For various types of entities like companies, persons or cities, the knowledge graph models are created in a data and use case driven manner. What is said about models in statistics “all models are wrong, but some are useful”, can also be said about modeling in knowledge graphs. There is no perfect knowledge graph model of an entity like “person” or “company”. There are only models that more or less fit to given use cases.

A frequent data source used in web-based knowledge graphs is Wikipedia. Depending on the type of entity and its properties or relations needed for a given use case, other data sources come into play. Example use cases we see here are “provide financial information”, like stock prices, and “provide location-based information”, like “current weather”. Implementing these use cases means adding information from multiple, unrelated data sources, like databases that provide the current stock price of SAP SE or the current weather in Karlsruhe.

Knowledge graphs should not be conceived to replace or encode all data sources as a graph. They are a carefully evaluated design based on use cases with loosely coupled integration of data sources. Knowledge graphs provide a flexible information access layer, which is extended driven by use cases and without the need for tight data integration. They cross the boundaries of the underlying, heterogenous data sources, e.g. in our example Wikipedia, stock market information databases, and weather databases.

Knowledge graphs also enable a free “follow your nose” like exploration in the network of information. From the entity you have searched for, e.g. “SAP SE”, you may discover other entities, properties and relations you have not been aware of: did you know that Walldorf is part of the Karlsruhe region? The “discovery of the unexpected” capabilities of knowledge graphs open the door to unimagined use cases and even business opportunities, which monetize existing data sources in new ways.

In SAP, a knowledge graph-based solution has been available for customers since 2021 with SAP S/4 Intelligent Situation Handling. Various knowledge graphs are currently being developed by multiple groups in SAP and are brought into production embedded in several solutions. In the following section, you will learn more about the role of knowledge graphs on business data, why integration of multiple data sources is needed, why we need flexible information access layers and why each use case in the context of generative AI has different needs for knowledge graphs.

2. Knowledge graphs on linked business data

We explain the value of knowledge graphs based on SAP S/4HANA.

SAP S/4HANA is an enterprise resource planning software meant to cover all day-to-day processes of an enterprise ranging from order-to-cash, procure-to-pay, plan-to-product, and request-to-service and many other core capabilities (link). Since the introduction of SAP S/4HANA in 2015, the software has been widely adopted by customers, guided by consultants, developers and domain experts. There are many different sources where we can find technical documentations, tutorials and instructions. In such a diverse universe of information, finding a single source of truth in its whole integrity is a challenging task.  Business knowledge graphs are exactly developed to address this problem: Bringing the business data together and model the data in context and relation to other objects, essentially forming linked business data.

Let’s conduct now another experiment, or rather formulate a requirement: we want to use data from SAP S/4HANA systems for generative AI. The data resides in tables with names that can only be understood by domain experts, like “EKKO”. Without deep knowledge in the domain of SAP data models, developers of generative AI applications as well as LLMs cannot make sense of such data.

A knowledge graph can come to the rescue both for the developer and LLMs. “Figure 2: Example knowledge graph about business metadata and underlying data sources” shows a part of such a knowledge graph.

fsasaki_0-1723030700523.png

Figure 2: Example knowledge graph about business metadata and underlying data sources

The knowledge graph again contains entities, properties and relations. Types of entities include ABAP tables like “EKKO”, business objects like “purchase order”, or entity sets from OData APIs. A key value of the knowledge graphs is that it stores business objects not as strings but as things, i.e. entities in the graph. Like all entities and relations in a knowledge graph, the business objects have global and unique identifiers and, in that way, enable the flexible integration of new (meta) data sources. Via the “related business object” relation, they constitute an extensible semantic access layer across the domain specific metadata items. In that way, the knowledge graph implements our requirement and enables semantic access to the table “EKKO”.

Like in the previous example around Web search, several data sources that are not directly connected beforehand come into play. These include the SAP S/4HANA data dictionary, SAP documentation, repositories of business objects and API definitions. The knowledge graph relates these data sources in a use case driven manner without trying to “boil the ocean”. We enlarge, for example, the knowledge graph iteratively and only add the parts of OData APIs definitions that are needed for a given use case. The decentral nature of the knowledge graph enables the “follow your nose” principle. For instance, after you have identified the business object “purchase contract” for SAP database table EKKO (Purchasing Document Header), you can find linked entities like the CDS view “I_PurchaseContract” or the OData entity set “I_PurchaseContract” that you have not been aware of, and that can trigger new use cases.

The knowledge graph generation process encompasses four common steps of knowledge graph life cycles: data extraction, modeling, knowledge graph generation and knowledge graph provision.

  • We extract metadata (for ABAP tables, business objects, CDS views etc.) from various SAP internal and external data sources.
  • We define the target model for our use case. The model captures only types of entities their properties which are needed by our use cases. We do not boil the ocean, i.e. we do not reflect all aspects of the source data in the knowledge graph. The modeling approach relies on our knowledge graph technology stack RDF, see details in the section “RDF as the technical basis for knowledge graphs”.
  • We transform the extracted data to the target model.
  • We upload the data to a knowledge graph data base and provide the graph to applications via APIs.

An additional step of indexing parts of the graph in a vector data base is part of our use cases, see details in the section “Knowledge graphs and retrieval augmented generation”.

In summary, with this knowledge graph on business metadata, we create a semantic access layer consisting of SAP’s business knowledge assets, which grows in a flexible and iterate manner over time, driven by use cases. In this way, we require less domain knowledge from developers and LLMs.

The knowledge graph will grow step-by-step. Each new iteration of extending the knowledge graph can leverage existing knowledge graph modeling. For example, an OData entity sets with poor documentation can benefit from the semantics provided by interconnected SAP Core Data Services (CDS) views and business objects.

Knowledge graphs are ideal for representing structured and unstructured data in a single place. We can easily represent “Purchase Order” as a single object in a knowledge graph and define new relations on this unique object, simply adding one more link and leaving the rest untouched.

We can simply grow the knowledge graph with business data according to our needs, including or excluding sources depending on the use case for which it will be used.

3. RDF as the technical basis for knowledge graphs

Our technical basis for generating, storing and querying knowledge graphs is the graph data model RDF (Resource Description Framework), a set of standards developed by the W3C (World Wide Web Consortium). The below RDF source code contains the RDF data that is behind the visualization via “Figure 2: Example knowledge graph about business metadata and underlying data sources”.

kg-definition.png

The RDF code snippet uses the RDF turtle syntax . From line 1 to line 12, prefixes are defined for classes of objects (e.g. ABAPTable) or properties (e.g. s4:relatedBusinessObject). RDF properties are used both for defining relations to other classes, like the s4:baseTable relation, as well as for referencing values, like cds:description.

The example shows also the role of RDF vocabularies , which are here identified via the prefixes (RDF also allows to create formal models for vocabularies via technologies like OWL or SHACL.). Based on use case needs, we can create and extend our use our own, SAP specific vocabularies and combine them with public vocabularies like SKOS. Vocabularies are a key means to realize the forehand mentioned knowledge graph capability of distinguishing “things from strings”. For example, we can distinguish two different entities which have the same name I_PURCHASECONTRACT as belonging to two different vocabularies: cdsview:I_PURCHASECONTRACT versus entitySet:I_PURCHASECONTRACT.

The actual RDF data in the example starts after the prefix definition in line 14. RDF graphs have the notion of so-called triples. The starting node of a relation is the subject (in line 14 abatable:EKKO), the relation is the property or predicate (in line 14 s4:relatedBusinessObject), the target of a relation is the object (in line 14 bo:PurchaseOrder).

Especially in the realm of knowledge graphs and generative AI, the graph data model of labelled property graphs (LPGs) has high attention and support by several graph tool vendors. The choice between graph data models is a topic of ongoing discussions, see a comparison of the two models. To this end, SAP is participating in the W3C RDF Star Working Group that will foster interoperability between RDF and LPGs.

We choose RDF because its capabilities like vocabularies are a key part of our knowledge graph approach: having knowledge graphs created and grown in a decentral manner, driven by use cases. In this way, various groups can extend the knowledge graphs. “Various groups” means groups in SAP and – forward looking – also outside SAP. I.e., we want to enable our partners and customers to create their own knowledge graph content.

4. LLMs without grounding versus knowledge graphs

A lot of information about SAP’s knowledge assets is available on the Web and has been “seen” by LLMs as part of their training data. So, the question arises why one cannot use the LLMs directly to make use of SAP’s business metadata.

We explain the reason with another experiment. First, we ask an LLM the question “What is the base table of the CDS view I_PURCHASECONTRACT?”. We then compare the results with a query to the knowledge graph. The result of both approaches is shown in “Figure 3: Hallucinations of LLMs compared to verified information provided by the business metadata knowledge graph”.

fsasaki_0-1723031521917.png

Figure 3: Hallucinations of LLMs compared to verified information provided by the business metadata knowledge graph

The response from the LLM on the left side of the figure provides some reasonable guesses. But the LLM also states that it cannot provide a precise answer. Also, even for the guesses, as a user we cannot verify the correctness of the results from the LLM.

The quality of response from LLMs can be improved with grounding. However, in some cases using knowledge graphs turns out to be very good technique.

The query to the knowledge graph provides the exact response. The CDS view has a relation which makes the base table explicit. Since we generated the graph out of trusted SAP metadata sources, the response is factful, precise and reliable.

5. Knowledge graphs and retrieval augmented generation

The forehand mentioned limitations of LLMs not knowing about SAP’s knowledge assets can be addressed by retrieval augmented generation (RAG) combined with knowledge graphs (coined GraphRAG). We will explain briefly how the interplay between RAG and knowledge graphs can look like in the example query “What is the base table of the CDS view I_PURCHASECONTRACT?”.

In RAG, data is stored as vector embeddings in a vector database. Retrieval and response generation involves LLMs. Embeddings are generated for a chunk of data. A user question in natural language is then matched against the vector database. Chunks that have a vector representation matching the question are used for generating a response to the question.

With the use case about finding the base table, we could store the information of base tables for CDS views in a document and index it in the vector database. This would then allow the retrieval of the base table just via the LLM.

The issue with this approach is that in SAP’s knowledge assets, there are many relations between entities like CDS views, ABAP tables and business objects. So, we cannot encode all potentially use case relevant relations in a chunk for a given entity. For example, a CDS view is not always directly connected to its base table. If we want to capture the base table relation, we will need to re-generate the index. And this would have to be done for any other distant relations that are relevant for the use case.

The approach to re-generate vector indices is time-consuming and (computationally) expensive. It also is not as precise as a query to the knowledge graph shown in Figure “Figure 2: Example knowledge graph about business metadata and underlying data sources”: even if the base table relation is stored in a chunk, the vector-based retrieval may provide different responses for the same user query.

Despite these limitations, there are huge benefits for using vector databases in generative AI. The natural language access is a key feature that a knowledge graph lacks. One can ask questions to an LLM or via an LLM to a vector database and get a response. That is not possible with knowledge graphs alone.

The way forward here is to combine the best of both worlds via GraphRAG: one can use LLMs as a way to access the verified information from the knowledge graph. Figure “Figure 4: Combining the best of two worlds: access to knowledge graphs via LLMs” shows how this works in general, without going into implementation details.

fsasaki_1-1723031651460.png

Figure 4: Combining the best of two worlds: access to knowledge graphs via LLMs

As a design-time step for the overall workflow, we extract well documented nodes of the knowledge graph. This extraction is realized via the RDF query language SPARQL. The below SPARQL query selects all CDS views and groups their descriptions and also the descriptions of related business objects.

sparql.png

An output of the query for the CDS view cds:I_PURCHASECONTRACT is below.

sparql-result.png

This is imported in the vector data base. Each chunk is then uniquely identified via the cds view ID. The embeddings of the chunk are generated based on the textual information extracted.

After this design time preparation step, the system can be used in runtime, as explained below.

A user asks a question about the base table. The LLM matches the entity I_PURCHASECONTRACT from the question against the vector database. The vector similarity processing leads to the chunk with the unique identifier for cdsview:I_PURCHASECONTRACT from the knowledge graph.

With this identifier and information about RDF properties like s4:baseTable, a knowledge graph query is generated. In our current scenarios and to assure robust SPARQL generation, this step is based on a pre-defined query template which is filled with the CDS view ID. The outcome of the query is the table name EKKO.

sparql-template.png

The response EKKO is then provided to the user.

The benefit of combining LLMs and knowledge graphs can be summarized as follows.

  • We enable natural language exploration of SAP’s unique knowledge assets
  • We ensure up-to-date and relevant responses to queries related to SAP data models
  • We avoid hallucinations and irrelevant results
  • We ensure domain-specific results, while exploiting the strength of LLM capabilities with regards e.g. to ranking results
  • We make data provenance explicit by supporting data lineage and traceability.

At SAP, we believe in the power of representing data in its unique context and in a relation to other objects. Only in this way, we can change the implicit to explicit and make the most out of rich enterprise data that we collected over years and provide value to the customers.

6. Evaluation of LLMs compared to knowledge graphs

We evaluated our implementation to prove the value of the knowledge graph compared to the LLM only approach described in section “LLMs without grounding versus knowledge graphs”.

We used a small evaluation set of business questions and CDS views or ABAP tables that are suitable responses to the questions. The questions were created by an S4 domain expert and by data scientists, who were the primary user group in our scenario. The following table summarizes the outcome of the experiments.

 

LLM w/o KG

LLM powered by KG

CDS view recall

0.17

0.40 in top 5 results

CDS view hallucination

26/37 (70%)

0

ABAP table recall

0.3

0.72 in top 10 results

ABAP table hallucination

6/83 (7%)

0

Our experiments were conducted with 22 business questions against around 5.400 public CDS views and around 3.000 ABAP base tables. The overall graph contains 80.000 CDS views and 450.000 ABAP tables. The LLM should know the public CDS views from Web documentation. Still, the LLM performs much worse than the approach relying on the knowledge graph. And since the knowledge graph results are always based on queries to the graph, there are no hallucinations at all.

The ABAP tables have less rich documentation compared to public CDS views. But after selecting CDS view IDs via the vector similarity search, we can retrieve the ABAP tables using the forehand described SPARQL base table query. This explains our good performance for ABAP tables: more than 70% of correct tables are detected. For the LLM only approach, only 30% of the results are correct, and again LLMs produce hallucinations.

7. Outlook on knowledge graphs and large language models

In this blog post, we have described the role of how knowledge graphs benefit generative AI. Focusing on SAP S/4HANA, with examples from our current knowledge graphs, we explained how knowledge graphs address the risk of hallucination and enable the development of generative AI applications with a key differentiator: SAP’s unique knowledge assets, encoded in a semantically explicit and trustworthy manner.

We are currently working on several use cases for the knowledge graphs. Use case areas include: 1) the support of data model discovery and configuration during the design time as well as the run time of AI applications; 2) making data lineage and data provenance explicit for AI applications, as a key contribution to trustworthy and explainable AI; and 3) providing semantics to SAP’s foundation model on structured data, which we mentioned at the beginning of this blog post. In addition, we are collaborating with several LoBs on how to enrich our current knowledge graphs with vast amounts of unstructured data in the SAP ecosystem, and we are looking into enabling SAP consultants, partners and customers to extend the knowledge graphs with their own content.

In terms of knowledge graph capabilities, we aim at realizing the full life cycle of knowledge graphs in the context of generative AI, see “Figure 5: How knowledge graphs and LLMs can benefit from each other”.

fsasaki_0-1723034233533.png

Figure 5: How knowledge graphs and LLMs can benefit from each other

In this cycle, there are some steps that are specific to knowledge graphs, like analyzing source data, modeling the knowledge graph, or providing access to the graph via LLMs, as discussed in section “Knowledge graphs and retrieval augmented generation”. But there is much more to explore on how knowledge graphs and LLMs can benefit each other, e.g. fine-tuning LLMs via knowledge graphs, using LLMs for generating knowledge graphs out of structured data, or supporting knowledge graph modeling via LLMs.

We are looking forward to continuing this exciting journey and to advancing the state of the art in generative AI for SAP, our customers, and the AI community at large.

Co-authored by Felix Sasaki, Isil Pekel, Pavithra G.K. and Johannes Hoffart

The following people (written in alphabetic order) contributed substantially to developing the implementation in this post that combines large language models and knowledge graphs: Isil Pekel, Felix Sasaki, Sebastian Schreiber, Paul Skiba, Darko Velkoski and Elham Zamansan.

Knowledge graphs in business AI are driven forward by the Business AI Knowledge Graphs team, led by Pavithra G.K. and including the following core contributors: Christoph Meyer, Tanguy Lucci, Natalia Minakova, Nikhil Patra, Isil Pekel, Manuel Zeise, Xiang Yu and Elham Zamansani.

The work on knowledge graphs and AI in SAP continues with further use cases and a growing number of people involved – further updates will follow!

8 Comments