Technology Blogs by SAP
Learn how to extend and personalize SAP applications. Follow the SAP technology blog for insights into SAP BTP, ABAP, SAP Analytics Cloud, SAP HANA, and more.
cancel
Showing results for 
Search instead for 
Did you mean: 
AkashAmarendra
Product and Topic Expert
Product and Topic Expert
3,242

Background 

With the growing popularity of generative AI, enterprises are increasingly embracing this technology to gain a competitive edge. However, the excitement surrounding the adoption of generative AI is tempered by a pressing concern—data protection. Customers face a critical dilemma: on one hand, they need to feed the critical business data to LLM (Large Language Model) for better analysis based on business context; on the other hand, it is mandatory for strict compliance with data protection laws to protect individual privacy and ensure responsible data management. 

With respect to the above, this blog talks about how to address it using CAP LLM Plugin (with the help of SAP HANA data anonymization) in SAP CAP applications ( Cloud Application Programming Model (CAP) application) that utilizes generative AI technology and the following solution diagram depicts the technical side of our sample application. 

Screenshot 2024-09-12 at 12.10.52 PM.png

 

Why Data Anonymization? 

 As mentioned earlier, enterprises need to strike a balance between data utility and privacy protection when sending their dataset to LLMs for business use cases. Data anonymization helps to achieve this goal. 

But before we go more details about data anonymization, we should look at the difference between data anonymization and data masking. 

Data masking conceals sensitive, classified, or personal information within a dataset by replacing it with random characters, dummy data, or fake information. This process creates an altered version of the data which leads to loss of authenticity and business context.  

Data Anonymization addresses these challenges with a more structured approach to modifying data for privacy protection by removing or altering personally identifiable information (PII) from the dataset making it impossible to identify an individual without losing the business context and authenticity.  

Knowing the difference between anonymization and masking, now we can jump to a detailed example of why we need data anonymization. 

First, we should understand that a dataset usually has these types of data:  

  • Identifiers: Attributes that clearly identify individuals in a dataset, for example names, account numbers, or email-addresses.  
  • Quasi-identifiers: Attributes that do not directly identify individuals but that may allow someone to deduce a person’s identity based on their unique combination such as age, ZIP code, or education.  
  • Sensitive information: Attributes that are highly sensitive, for example people’s health status, salaries, Social Security Number (SSN) or other confidential information. 

Second, imagine we have the following HR dataset:  

AkashAmarendra_1-1706555969830.png

The above HR dataset contains region, tlevel(job grade), gender, age, and salary information of the employees from a company, and none of which can identify individual employees on their own, because all identifiers (like names, ids, emails) are deleted (or hidden). But it does not actually prevent people from being identified based on their quasi-identifiers. A data analyst from the company might happen to find only one person who is 35 years old, female, and living in a certain region and realizes this is one of their colleagues. Now their colleague’s sensitive salary information is exposed to the data analyst. Something similar might happen too when organizations send their dataset to LLMs, because that dataset might be used for purposes like training the model and then exposed. 

Why the CAP LLM Plugin

CAP LLM Plugin offers a seamless way for anonymizing sensitive data within SAP CAP applications by harnessing the anonymization capabilities of SAP HANA Cloud while still preserving the business context of the data. 

Features of the CAP LLM Plugin: 

  • Easily installable. 
  • Supports all the capabilities of SAP HANA Cloud’s data anonymization. 
  • Provides a simple 2-step process to anonymize data in a CAP application.  
  • (Roadmap item) Integration with SAP HANA Cloud’s Vector Store/Engine to effortlessly store vector embeddings in SAP HANA Cloud and seamlessly perform similarity search. 
  • (Roadmap item) Integration with Joule, the natural language, generative AI copilot to leverage the full potential of conversational AI with SAP systems.

How to use the CAP LLM Plugin in CAP applications 

Let’s explore how we can seamlessly use the CAP LLM Plugin to achieve data anonymization in an example CAP application leveraging SAP HANA Cloud’s data anonymization techniques. 

Consider a CAP application where personalized emails need to be generated to congratulate employees on their contributions to the company. This process involves feeding sensitive employee information (as shown below) to the LLM to generate personalized employee emails. However, we want to ensure that the confidentiality of this sensitive information is preserved while retaining sufficient business context. 

AkashAmarendra_2-1706556092817.png

In this scenario, the CAP LLM Plugin can be used to anonymize the sensitive employee data by applying SAP HANA Cloud data anonymization techniques. The following table shows how the anonymized employee data might potentially look like:

AkashAmarendra_3-1706556137779.png

If we observe the anonymized employee data, we see the quasi-identifiers (columns which potentially lead to identifying individual employee) such as ‘age’, ‘tlevel’ (job grade), ‘gender’, ‘region’ has been modified so that it is impossible to identify the individual employee from the dataset while at the same time preserving essential employee information for the LLMs to generate meaningful response. 

Now let’s have a look at steps to apply the CAP LLM Plugin for data anonymization in this example scenario. 

Note on CAP application structure: 

  • In a CAP Application, there are usually 3 folders - app, db, and srv. 
  • db is where you define the entity. 
  • srv is where you expose it and use the entity. 
  • Detailed information on how to build a CAP application can be found here. 

Pre-requisites: 

  • Install the CAP LLM plugin in your CAP project using the following npm command: 

 

 

 

 

 

 

 

npm install cap-llm-plugin 

 

 

 

 

 

 

 

You will need SAP HANA Cloud attached to your CAP application already. Refer here to deploy your CAP application on SAP BTP, Cloud Foundry Runtime. Refer here to enable Cloud Foundry environment in your SAP BTP subaccount. 

  • Set up default permissions for HDI container in your CAP application. In the db section of the CAP application, perform the steps specified in the "how" section of theblog. 
  • Add the cds "cap-llm-plugin" service in the cds requires section of package.json in the CAP application as follows: 

 

 

 

 

 

 

 

"cds": { 
    "requires": { 
      "cap-llm-plugin": true 
    } 
  } 

 

 

 

 

 

 

 

Steps to apply CAP LLM Plugin in CAP application: 

You can anonymize entities in a CAP application using the CAP LLM plugin in a simple two-step process as follows: 

Step 1: Defining an anonymized entity: 

You can anonymize the entity in a CAP application with a single ‘@anonymize’ annotation. The annotations can be applied at entity and entity column level. Specifically, this is applied on the entity definitions in the db folder. 

Annotating at the entity level:  

You will need to annotate the entity you want to anonymize with the ‘@anonymize’ annotation and provide the anonymization algorithm and parameters needed to anonymize the entity. The plugin will then apply the algorithm and parameters to anonymize the entity. 

Annotating at the column level:  

You will need to identify and annotate the column to act as a sequence column and specify how to anonymize each column. Not all columns need to be anonymized. 

For more information on the data anonymization algorithms and parameters, refer to the SAP HANA Cloud Data Anonymization Documentation. 

Now, let's explore how to apply annotations to anonymize an entity in a CAP application. In the example below, consider the employee entity containing sensitive employee information. 

Firstly, we specify the data anonymization algorithm for the entity by passing `ALGORITHM 'K-ANONYMITY' and the necessary parameters to ‘@anonymize’ annotation. Here the employee entity will be anonymized using theK-Anonymity algorithm, which basically conceals individual employees within a group of similar employees, with the parameter 'k=3' indicating the group size of 3 employees. 

Secondly, we specify how the columns of the entity need to be anonymized. The ‘id’ column is designated as a sequence column by passing ` {"is_sequence": true} ` to ‘@anonymize’ annotation. Similarly, the columns ‘region’, ‘gender’ and ‘age’ are marked as quasi-identifiers (columns which potentially lead to identifying individual employee) by setting ‘{"is_quasi_identifier": true’ and passing the hierarchy details. 

In the sample entity, we specify that the ‘region’ column has values [["APJ"],["EMEA"],["NA"], which should be considered for anonymization. For the ‘age’ column, we specify that the actual age values need to be anonymized to a value closer to the actual value.  For instance, in the sample entity, the age value of 42 will be anonymized to 45, which is still closer to the actual without being completely randomized. 

 

 

 

 

 

 

 

using { Currency, managed, sap } from '@sap/cds/common'; 
namespace sap.cap; 

@anonymize : `ALGORITHM 'K-ANONYMITY' PARAMETERS '{"k" : 3}'` 
entity Employee { 
  key id : Integer @anonymize : `{"is_sequence": true}`; 
  name : String ; 
  region : String  @anonymize: `{"is_quasi_identifier" : true,  "hierarchy": {"embedded" : [["APJ"],["EMEA"],["NA"]]}}`; 
  tlevel : String; 
  gender : String @anonymize : `{"is_quasi_identifier" : true, "hierarchy": {"embedded" : [["Female"],["Male"]]}}`; 
  age: String @anonymize : `{"is_quasi_identifier" : true, "hierarchy": {"embedded" : [["27", "15"], ["42", "45"], ["50", "45"], ["12", "15"]]}}`; 
  personalizedEmail : String; 
} 

 

 

 

 

 

 

 

With these settings, the plugin applies the ‘K-Anonymity’ algorithm and ensures that a single or a combination of these quasi-identifiers cannot be used to identify the individual employee. 

Now that you have annotated the entity, the plugin takes care of dynamically anonymizing the entity using the inputs provided to the ‘@anonymize’ annotation. 

You can optionally also pass in custom hierarchy functions with custom logic to anonymize entity columns.  

Step 2: Consume the anonymized data of the entity: 

Now that the entity has been anonymized, you can consume the anonymized data of the entity using a single method.  

In the below sample, let’s explore how we can consume the anonymized employee data from the employee entity: 

Firstly, we will need to define the service that exposes the anonymized employee entity, for example, in the srv/employee-service.cds file, as follows:

 

 

 

 

 

 

 

using { sap.cap as cap } from '../db/schema'; 
service EmployeeService @(path:'/browse') { 
  entity Employee as projection on cap.Employee; 
} 

 

 

 

 

 

 

 

Next, in the business logic (for instance, in the srv/employee-service.js file), utilize the getAnonymizedData method of the cds “cap-llm-plugin” service to retrieve the anonymized data of the 'Employee' entity exposed as 'EmployeeService,' as follows: 

Optionally, to retrieve specific records for the anonymized employee entity, you can pass the sequence id(s) as follows: 

For example, this will retrieve the record with id=1001. 

 

 

 

 

 

 

 

const anonymizer = await cds.connect.to("cap-llm-plugin"); 
let response = await anonymizer.getAnonymizedData("EmployeeService.Employee",[1001]) 

 

 

 

 

 

 

 

Now that you have the anonymized employee data, you can feed this anonymized data which retains crucial business logic to LLMs and obtain meaningful response, while preserving the confidentiality of the data. 

Detailed documentation on the use of the CAP LLM plugin with samples can be found here. 

Thanks a lot for going through the whole blog till this part. Hopefully you have a good understanding of data anonymization and how to apply this with CAP LLM Plugin within a CAP application. Please stay tuned since we have new features coming soon, such as integration with SAP HANA Cloud’s Vector Engine and Joule (the natural language Gen AI copilot).  

Additional Reading:

Please have a look at the discovery center mission regarding Retrieval Augmented Generation with GenAI on SAP BTP.

Credits:

Many thanks to our colleagues for their support and collaboration in validating the plugin - Alper Dedeoglu, David Kunz, Steffen Weinstock, Kaweh Amoi-Taleghani. Thanks to our team members for their contributions – Liang Feng, Alex Bishka, Karishma Kapur, Sangeetha Krishnamoorthy, Weikun Liu. Special thanks to Sivakumar N and Anirban Majumdar for support and guidance.

If you have any questions, please contact us at paa@sap.com