CRM and CX Blog Posts by SAP
Stay up-to-date on the latest developments and product news about intelligent customer experience and CRM technologies through blog posts from SAP experts.
cancel
Showing results for 
Search instead for 
Did you mean: 
YannickRobin
Product and Topic Expert
Product and Topic Expert
2,034

Problem Statement

In Dynatrace, you may see regular container restarts associated to the error message "The container platform was OOMKilled". 

OOMKilled.png

We have noticed this is issue may happen with any JVM service including Solr.

Please note the error is at the container level and it should not be confused with Out-of-Memory error due to JVM heap saturation.

It can be confirmed by observing the container memory. We see that at the time of the restart, the memory usage of the container is very close to the container limit. The small gap is due to the container overhead.

memory usage.png

Analysis

For deep-dive analysis of the root cause, it require to enable native memory tracking on the JVM.

 

ccv2.additional.catalina.opts=-XX:NativeMemoryTracking=detail -Xlog:nmt=trace,thread*=trace:file=VMlog.log:time,level,pid,tid,tags

 

Then you can raise a ticket to the Support to collect regularly data using jcmd utility with the command jcmd 1 VM.native_memory.

Analysing VMLog files, you will be able to calculate the total Native Memory Committed and understand which area of the non-heap memory keeps increasing until saturation.

Native Memory.png

In most cases, we have identified that the issue is not due to the non-heap memory growing but because of slow increase of memory fragmentation until RSS memory reaches the container limit minus the container overhead.

RSS Memory.png

A common pattern for the memory fragmentation increase seems to be the high number of thread creations/terminations. The high Thread Stack allocation and deallocation throughput creates many small chunks of free memory over the time and increase RSS memory fragmentation.

Set Malloc Arena Max to reduce memory fragmentation

The GNU C library's (glibc's) malloc library contains a handful of functions that manage allocated memory in the application's address space.

malloc.png

This malloc is a "heap" style malloc, which means that chunks of various sizes exist within a larger region of memory (a "heap"). glibc's malloc allows for multiple heaps, each of which grows within its address space.

An arena is a structure that is shared among one or more threads which contains references to one or more heaps, as well as linked lists of chunks within those heaps which are "free". Threads assigned to each arena will allocate memory from that arena's free lists.

The workaround to reduce the memory fragmentation is to reduce the number of Malloc Arena.

Our benchmark with vanilla SAP Commerce Cloud Storefront stressed at very high load shows that the performance is not impacted by reducing the size.

Below is the outcome for 10 cores (default 80 malloc arenas).

Malloc size vs fragmentation.png

On 64-bit machine, the default is 8 arenas per core assigned to the service and we suggest reducing to 2 arenas per core.

For this, we need to set MALLOC_ARENA_MAX as a new environment variable for the container causing the issue in CCV2 using SAP Commerce Cloud Portal.

Please note this is not setting a JVM parameter but setting a new environment variable so for this change and it requires to customize the Catalina startup script.

Below is the procedure:

  • Override /opt/tomcat/bin/catalina.sh by configuring the service properties below

 

ccv2.file.override.70.content=<BASE64_CONTENT_CUSTOM_CATALINA>
ccv2.file.override.70.path=/opt/tomcat/bin/catalina.sh
ccv2.file.override.71.content=<BASE64_CONTENT_OF_ORIGINAL_FILE>
ccv2.file.override.71.path=/opt/tomcat/bin/catalina-original.sh

 

<BASE64_CONTENT_CUSTOM_CATALINA> is the content below encoded to Base64 format.

 

#!/usr/bin/env bash
set -e
echo "Set MALLOC ARENA MAX to 20"
export MALLOC_ARENA_MAX=20
chown hybris /opt/tomcat/bin/catalina-original.sh
chmod +x /opt/tomcat/bin/catalina-original.sh
exec /opt/tomcat/bin/catalina-original.sh run

 

For 64-bit machine, default value of MALLOC_ARENA_MAX is 8x the number of cores. 

To reduce memory defragmentation, we suggest setting 2x the number of cores for Service pod (ex: number of cores for Storefront service). Above script is setting MALLOC_ARENA_MAX value as 20 for a 10 cores pod.

<BASE64_CONTENT_OF_ORIGINAL_CATALINA_FILE> is the original /opt/tomcat/bin/catalina.sh encoded to Base64 format.

Note that in case of Tomcat upgrade, you need to encode again this value. For this, you can execute the following script in HAC after upgrade to check that the value did not change:

 

import org.apache.commons.io.FileUtils;
String contents = FileUtils.readFileToString(new File("/opt/tomcat/bin/catalina.sh"), "UTF-8")

 

Below is an example of properties with catalina-original.sh of SAP Commerce 2211.27 (Tomcat 9.0.86).

 

ccv2.file.override.70.content=IyEvdXNyL2Jpbi9lbnYgYmFzaApzZXQgLWUKZWNobyAiU2V0IE1BTExPQyBBUkVOQSBNQVggdG8gMjAiCmV4cG9ydCBNQUxMT0NfQVJFTkFfTUFYPTIwCmNob3duIGh5YnJpcyAvb3B0L3RvbWNhdC9iaW4vY2F0YWxpbmEtb3JpZ2luYWwuc2gKY2htb2QgK3ggL29wdC90b21jYXQvYmluL2NhdGFsaW5hLW9yaWdpbmFsLnNoCmV4ZWMgL29wdC90b21jYXQvYmluL2NhdGFsaW5hLW9yaWdpbmFsLnNoIHJ1bg==
ccv2.file.override.70.path=/opt/tomcat/bin/catalina.sh
ccv2.file.override.71.content=
ccv2.file.override.71.path=/opt/tomcat/bin/catalina-original.sh

 

  • After the change of the service configuration, the service should be restarted to be effective
  • To confirm the environment variable has been set, you can execute the following Groovy script on HAC on the right service.

 

def envVars = System.getenv()
envVars.each { key, value ->
    if(key.equals("MALLOC_ARENA_MAX"))
    {
        println("${key} = ${value}")
    }
}

 

7 Comments