With the growing popularity of generative AI, enterprises are increasingly embracing this technology to gain a competitive edge. However, the excitement surrounding the adoption of generative AI is tempered by a pressing concern—data protection. Customers face a critical dilemma: on one hand, they need to feed the critical business data to LLM (Large Language Model) for better analysis based on business context; on the other hand, it is mandatory for strict compliance with data protection laws to protect individual privacy and ensure responsible data management.
With respect to the above, this blog talks about how to address it using CAP LLM Plugin (with the help of SAP HANA data anonymization) in SAP CAP applications ( Cloud Application Programming Model (CAP) application) that utilizes generative AI technology and the following solution diagram depicts the technical side of our sample application.
As mentioned earlier, enterprises need to strike a balance between data utility and privacy protection when sending their dataset to LLMs for business use cases. Data anonymization helps to achieve this goal.
But before we go more details about data anonymization, we should look at the difference between data anonymization and data masking.
Data masking conceals sensitive, classified, or personal information within a dataset by replacing it with random characters, dummy data, or fake information. This process creates an altered version of the data which leads to loss of authenticity and business context.
Data Anonymization addresses these challenges with a more structured approach to modifying data for privacy protection by removing or altering personally identifiable information (PII) from the dataset making it impossible to identify an individual without losing the business context and authenticity.
Knowing the difference between anonymization and masking, now we can jump to a detailed example of why we need data anonymization.
First, we should understand that a dataset usually has these types of data:
Second, imagine we have the following HR dataset:
The above HR dataset contains region, tlevel(job grade), gender, age, and salary information of the employees from a company, and none of which can identify individual employees on their own, because all identifiers (like names, ids, emails) are deleted (or hidden). But it does not actually prevent people from being identified based on their quasi-identifiers. A data analyst from the company might happen to find only one person who is 35 years old, female, and living in a certain region and realizes this is one of their colleagues. Now their colleague’s sensitive salary information is exposed to the data analyst. Something similar might happen too when organizations send their dataset to LLMs, because that dataset might be used for purposes like training the model and then exposed.
CAP LLM Plugin offers a seamless way for anonymizing sensitive data within SAP CAP applications by harnessing the anonymization capabilities of SAP HANA Cloud while still preserving the business context of the data.
Features of the CAP LLM Plugin:
Let’s explore how we can seamlessly use the CAP LLM Plugin to achieve data anonymization in an example CAP application leveraging SAP HANA Cloud’s data anonymization techniques.
Consider a CAP application where personalized emails need to be generated to congratulate employees on their contributions to the company. This process involves feeding sensitive employee information (as shown below) to the LLM to generate personalized employee emails. However, we want to ensure that the confidentiality of this sensitive information is preserved while retaining sufficient business context.
In this scenario, the CAP LLM Plugin can be used to anonymize the sensitive employee data by applying SAP HANA Cloud data anonymization techniques. The following table shows how the anonymized employee data might potentially look like:
If we observe the anonymized employee data, we see the quasi-identifiers (columns which potentially lead to identifying individual employee) such as ‘age’, ‘tlevel’ (job grade), ‘gender’, ‘region’ has been modified so that it is impossible to identify the individual employee from the dataset while at the same time preserving essential employee information for the LLMs to generate meaningful response.
Now let’s have a look at steps to apply the CAP LLM Plugin for data anonymization in this example scenario.
Note on CAP application structure:
npm install cap-llm-plugin
You will need SAP HANA Cloud attached to your CAP application already. Refer here to deploy your CAP application on SAP BTP, Cloud Foundry Runtime. Refer here to enable Cloud Foundry environment in your SAP BTP subaccount.
"cds": {
"requires": {
"cap-llm-plugin": true
}
}
You can anonymize entities in a CAP application using the CAP LLM plugin in a simple two-step process as follows:
You can anonymize the entity in a CAP application with a single ‘@anonymize’ annotation. The annotations can be applied at entity and entity column level. Specifically, this is applied on the entity definitions in the db folder.
Annotating at the entity level:
You will need to annotate the entity you want to anonymize with the ‘@anonymize’ annotation and provide the anonymization algorithm and parameters needed to anonymize the entity. The plugin will then apply the algorithm and parameters to anonymize the entity.
Annotating at the column level:
You will need to identify and annotate the column to act as a sequence column and specify how to anonymize each column. Not all columns need to be anonymized.
For more information on the data anonymization algorithms and parameters, refer to the SAP HANA Cloud Data Anonymization Documentation.
Now, let's explore how to apply annotations to anonymize an entity in a CAP application. In the example below, consider the employee entity containing sensitive employee information.
Firstly, we specify the data anonymization algorithm for the entity by passing `ALGORITHM 'K-ANONYMITY' and the necessary parameters to ‘@anonymize’ annotation. Here the employee entity will be anonymized using the ‘K-Anonymity’ algorithm, which basically conceals individual employees within a group of similar employees, with the parameter 'k=3' indicating the group size of 3 employees.
Secondly, we specify how the columns of the entity need to be anonymized. The ‘id’ column is designated as a sequence column by passing ` {"is_sequence": true} ` to ‘@anonymize’ annotation. Similarly, the columns ‘region’, ‘gender’ and ‘age’ are marked as quasi-identifiers (columns which potentially lead to identifying individual employee) by setting ‘{"is_quasi_identifier": true’ and passing the hierarchy details.
In the sample entity, we specify that the ‘region’ column has values [["APJ"],["EMEA"],["NA"], which should be considered for anonymization. For the ‘age’ column, we specify that the actual age values need to be anonymized to a value closer to the actual value. For instance, in the sample entity, the age value of 42 will be anonymized to 45, which is still closer to the actual without being completely randomized.
using { Currency, managed, sap } from '@sap/cds/common';
namespace sap.cap;
@anonymize : `ALGORITHM 'K-ANONYMITY' PARAMETERS '{"k" : 3}'`
entity Employee {
key id : Integer @anonymize : `{"is_sequence": true}`;
name : String ;
region : String @anonymize: `{"is_quasi_identifier" : true, "hierarchy": {"embedded" : [["APJ"],["EMEA"],["NA"]]}}`;
tlevel : String;
gender : String @anonymize : `{"is_quasi_identifier" : true, "hierarchy": {"embedded" : [["Female"],["Male"]]}}`;
age: String @anonymize : `{"is_quasi_identifier" : true, "hierarchy": {"embedded" : [["27", "15"], ["42", "45"], ["50", "45"], ["12", "15"]]}}`;
personalizedEmail : String;
}
With these settings, the plugin applies the ‘K-Anonymity’ algorithm and ensures that a single or a combination of these quasi-identifiers cannot be used to identify the individual employee.
Now that you have annotated the entity, the plugin takes care of dynamically anonymizing the entity using the inputs provided to the ‘@anonymize’ annotation.
You can optionally also pass in custom hierarchy functions with custom logic to anonymize entity columns.
Step 2: Consume the anonymized data of the entity:
Now that the entity has been anonymized, you can consume the anonymized data of the entity using a single method.
In the below sample, let’s explore how we can consume the anonymized employee data from the employee entity:
Firstly, we will need to define the service that exposes the anonymized employee entity, for example, in the srv/employee-service.cds file, as follows:
using { sap.cap as cap } from '../db/schema';
service EmployeeService @(path:'/browse') {
entity Employee as projection on cap.Employee;
}
Next, in the business logic (for instance, in the srv/employee-service.js file), utilize the getAnonymizedData method of the cds “cap-llm-plugin” service to retrieve the anonymized data of the 'Employee' entity exposed as 'EmployeeService,' as follows:
Optionally, to retrieve specific records for the anonymized employee entity, you can pass the sequence id(s) as follows:
For example, this will retrieve the record with id=1001.
const anonymizer = await cds.connect.to("cap-llm-plugin");
let response = await anonymizer.getAnonymizedData("EmployeeService.Employee",[1001])
Now that you have the anonymized employee data, you can feed this anonymized data which retains crucial business logic to LLMs and obtain meaningful response, while preserving the confidentiality of the data.
Detailed documentation on the use of the CAP LLM plugin with samples can be found here.
Thanks a lot for going through the whole blog till this part. Hopefully you have a good understanding of data anonymization and how to apply this with CAP LLM Plugin within a CAP application. Please stay tuned since we have new features coming soon, such as integration with SAP HANA Cloud’s Vector Engine and Joule (the natural language Gen AI copilot).
Please have a look at the discovery center mission regarding Retrieval Augmented Generation with GenAI on SAP BTP.
Many thanks to our colleagues for their support and collaboration in validating the plugin - Alper Dedeoglu, David Kunz, Steffen Weinstock, Kaweh Amoi-Taleghani. Thanks to our team members for their contributions – Liang Feng, Alex Bishka, Karishma Kapur, Sangeetha Krishnamoorthy, Weikun Liu. Special thanks to Sivakumar N and Anirban Majumdar for support and guidance.
If you have any questions, please contact us at paa@sap.com
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
User | Count |
---|---|
30 | |
18 | |
10 | |
9 | |
8 | |
8 | |
7 | |
7 | |
6 | |
6 |