Data security becomes more and more important, especially as applications and services move away from onsite installation and into the cloud. In the past there have been several cases where data leaks have had a severe impact on the reputation of companies. One approach to preventing such data leaks is encryption of data. Encryption in general can be implemented on several layers of the I/O stack.
This blog post describes use-cases for LUKS based disk encryption in a SAP HANA environment by distinguishing it from other data encryption methods. Additionally, it provides a basic step-by-step tutorial for setting up a test scenario and briefly discusses the impact on the I/O performance of a IBM Power10 LPAR.
The layer of the I/O stack where encryption is supposed to be implemented depends mainly on security requirements and the level of trust that can be granted to the infrastructure. A common I/O stack for an IBM Power system is shown in the picture below.
Simplified I/O stack on IBM Power
As all data above the encryption layer is encrypted before being passed to the lower layer, it may seem to make sense to use encryption of the storage itself. This makes perfect sense as long as the layers above can be trusted when they process unencrypted data. This can be a security issue, especially in managed environments where these layers are operated by other companies.
The other extreme, performing encryption on the application level, can ensure on the one hand, that unencrypted data is only available within the application. On the other hand, all data that is not encrypted by the application, like the application binaries themselves, the operating system and other data stored on the server remain unencrypted and therefore vulnerable. Nevertheless in cases where all layers below the application are maintained by an external party, like Software as a Service (SaaS) or similar hosting offerings, this is the only way to implement a trusted layer of data encryption.
There is another blog post that covers the encryption capabilities of SAP HANA:
Disk encryption above or below the LVM layer provides a compromise in situations where full control over the LPAR is available. This is the case e.g. for self operated machines, as well as for Infrastructure as a Service (IaaS) offerings. It can be automatically deployed with little effort, is highly flexible and has only a small impact on performance due to hardware accelerated crypto routines.
The requirements regarding trust towards the infrastructure are very small, as all data written to or read from disks outside the LPAR is stored encrypted. In contrast to encryption on storage systems, disk encryption can also be applied to local disks. Once implemented it cannot be disabled accidentally, as it may be the case for encryption methods at the application level. It is completely independent from file systems as it is integrated into the device mapper. This makes it fully transparent to the layers above and preserves flexibility regarding the choice of file system.
Disk encryption with Linux Unified Key Setup (LUKS) uses dm-crypt and the Linux kernel Crypto API. This makes it platform-independent while still allowing the implementation of platform-specific optimizations, e.g. to utilize hardware accelerators for crypto routines. The sum of these attributes allows the implementation of a homogeneous encryption concept, independent from filesystem, application and storage infrastructure as long as the LPAR itself is considered secure.
Setup of LUKS devices in a test environment
In preparation for creating LUKS encrypted devices the underlying devices have to be provisioned. This process is well documented in SAP Note 2055470 and the Storage guide (Chapter 10: "Setup of file systems for DATA and LOG") mentioned in the Note.
Setting up a LUKS device is very easy. First of all a secret is needed. Besides passwords, LUKS supports different key formats. In a productive environment the keys must not be placed next to the encrypted devices on the server for obvious reasons. Thus both of the major enterprise distributions provide means to implement a central key store and distribute the encryption key at boot time to the clients. The documentation can be found here:
Nevertheless this guide uses a local key file to keep things simple. A RSA private key for testing purposes can be created like this:
openssl genrsa -out <KEY_FILE>
Beside the key file you need to know the full path to the VOLUME as well as the LUKS_DEVICE_NAME which will be assigned to the newly created LUKS device. To initially create a LUKS volume it has to be formatted first:
Internal tests showed that explicitly specifying a sector size of 4 KiB improved the performance significantly (default: 512 bytes). Now that the volume is created, it has to be opened using the KEY_FILE and be assigned a LUKS_DEVICE_NAME.
Afterwards LVM and the filesystem can be set up as usual. For example a single volume group named vg_luks can be created, which contains a single logical volume. This single logical volume uses all disk space available, striped over 8 disks with a stripe size of 128k and is named lv_luks:
Finally, the file system is created on the LUKS device.
mkdir -p /mnt/luks
mount /dev/mapper/luks_device /mnt/luks
Basic I/O throughput tests
Encryption does not come for free. Therefore, a set of basic I/O tests was used in order to get an idea of the impact on I/O performance. The tests are chosen in such a way that they represent common I/O operations used by SAP HANA. Log files are typically continuously overwritten in a sequential way with comparably small block sizes. Data files on the other hand, are read and written randomly with commonly larger block sizes. The throughput is measured by a single threaded program. This reduces limitations caused by competing requests. The tests were performed on an IBM Power E1050 LPAR with 8 dedicated cores.
The tests were executed on three different configurations:
An unencrypted device was used as baseline
LUKS device created on the logical volume
LUKS devices created on each disk
The results of these measurements are shown in the following figure:
Basic I/O throughput tests
The first finding is that the write throughput is higher on the encrypted setups than on the unencrypted one. This is caused by the internal parallelization of the encrypted write operations. As there are no concurrent workloads running, all threads are available for the single process that runs the test. As a consequence the write operations are investigated in more detail in the next measurements.
The second and more meaningful finding applies to the reading performance. Here the numbers are comparable and show only a small impact on the throughput caused by encryption as long as the block size fits to the striping configuration. The configuration of eight disks with a stripe size of 128k clearly favors larger block sizes. If the SAP application uses a higher share of smaller I/O operations, it could be considered to use a setup with a smaller stripe size or number of disks. It can pay off to monitor the I/O behavior of the SAP application. In other words, it is valuable to monitor on a regular basis the I/O behavior of the SAP application
A cryptsetup option can be used to disable the use of write workqueue and thereby make the throughput numbers comparable:
Repeating the write tests with this option enabled affects the result significantly as shown in the following figure:
Basic I/O throughput tests with `--perf-no_write_workqueue`
The most noticeable observation here is the impact of the stripe configuration on the throughput (red box). Fully utilizing the stripe set per block (in other words using a block size of 128k times 8 disks) reduces the impact of encryption to a bare minimum.
The second most important information that can be gained from these numbers relates to the impact of parallelization on throughput. Even though the internal parallelization is disabled the encryption is parallelized by the number of LUKS devices in the stripe set. In this setup the configuration with encryption on disk level uses eight LUKS devices, while the one with encryption on top of the logical volume only uses one. Despite the throughput being a little lower than with write queues enabled it is already in the range of the unencrypted baseline measurements. Consequently it can be assumed that there is no need for a parallel setup of multiple LUKS devices with regards to throughput.
Last but not least, the performance of the configuration with encryption on top of LVM needs to be discussed. It clearly shows that the encryption process represents a certain workload that generates additional load on the CPU. As shown in the first measurements, the impact on throughput is negligible due to the parallel execution of the encryption.
The CPU time that is actually used to process this workload, can be approximated by collecting and comparing perf traces. In order to do so, traces of runs with 16MB random writes are compared. RAM disks (the BRD kernel module) are used instead of volumes on a storage device to avoid any unintentional impacts or delays. The following figures show the top ten symbols with the counted CPU cycles for a configuration without encryption and with encryption on disk level.
`perf` trace for 16MB BS random access not encrypted
`perf` trace for 16MB BS random access encrypted at disk level
The trace of the run without encryption shows that the majority of the time (13k cycles) is spent in memcpy_power7. This is used to copy data to the RAM. With encryption three times as many cycles are counted for this symbol. This is reasonable as data has to be moved back and forth to get encrypted. Ignoring the __ppc64_runlatch_off symbol as it can be considered idle, the total number of counted cycles for the encrypted setup is 30 times higher. It has to be considered that the additional overhead of writing to and waiting for the disk has been bypassed by using a RAM disk. Thus this ratio will be far smaller in real world scenarios.
Performance of a generated SAP workload
Comparing isolated I/O operations does not take into account that I/O is usually only one of many parts of a workload. In order to simulate a real workload the same disk configurations as in the previous tests are used for the SAP HANA data and log volumes. In order to generate load on the database, an increasing number of user interactions were generated, which are supposed to simulate transactional and analytical workloads. The corresponding dialog response times are displayed in the following figure.
Dialog response time comparison
This figure shows that for up to 60k users the response times of all three setups are nearly indistinguishable. Starting around 70k users the system is fully utilized and the impact of the additional load caused by disk encryption is starting to become visible. The impact increases up to 90k users where the system is by far overloaded as the average response time is already above 3s. Even at this point the response times of the encrypted setups are only around 8% slower than on the unencrypted setup. Transaction and query throughput at this point is approximately 2% lower as shown in the following two figures.
Transaction throughput comparison
Query throughput comparison
Disk encryption with LUKS in a SAP HANA environment is applicable on IaaS or on self-operated machines where full and exclusive access to the LPAR is guaranteed. In other cases only higher level encryption features like those provided by the application can be implemented to effectively increase data security. In completely trusted environments and lower security requirements the encryption might be handled more effectively on the storage.
In cases where it can be applied, disk encryption with LUKS:
is easy to setup and maintain
has only a minor impact on I/O and CPU performance
can provide disk encryption for the whole system
is highly flexible due to its platform independence
The measurements show that the internal parallelization is capable of parallelizing the encryption workload effectively. Thus, it makes sense to implement disk encryption on top of the LVM layer as long as security requirements allow the exposure of LVM meta data. This reduces the risk of over-parallelization and simplifies the setup.
Special thanks to Heinz Mauelshagen (Red Hat) and Rakshith Prakash (IBM) for contributing their knowledge and work to support these investigations.