CRM and CX Blogs by SAP
Stay up-to-date on the latest developments and product news about intelligent customer experience and CRM technologies through blog posts from SAP experts.
cancel
Showing results for 
Search instead for 
Did you mean: 
YannickRobin
Product and Topic Expert
Product and Topic Expert
1,690

Problem Statement

In Dynatrace, you may see regular container restarts associated to the error message "The container platform was OOMKilled". 

OOMKilled.png

We have noticed this is issue may happen with any JVM service including Solr.

Please note the error is at the container level and it should not be confused with Out-of-Memory error due to JVM heap saturation.

It can be confirmed by observing the container memory. We see that at the time of the restart, the memory usage of the container is very close to the container limit. The small gap is due to the container overhead.

memory usage.png

Analysis

For deep-dive analysis of the root cause, it require to enable native memory tracking on the JVM.

 

ccv2.additional.catalina.opts=-XX:NativeMemoryTracking=detail -Xlog:nmt=trace,thread*=trace:file=VMlog.log:time,level,pid,tid,tags

 

Then you can raise a ticket to the Support to collect regularly data using jcmd utility with the command jcmd 1 VM.native_memory.

Analysing VMLog files, you will be able to calculate the total Native Memory Committed and understand which area of the non-heap memory keeps increasing until saturation.

Native Memory.png

In most cases, we have identified that the issue is not due to the non-heap memory growing but because of slow increase of memory fragmentation until RSS memory reaches the container limit minus the container overhead.

RSS Memory.png

A common pattern for the memory fragmentation increase seems to be the high number of thread creations/terminations. The high Thread Stack allocation and deallocation throughput creates many small chunks of free memory over the time and increase RSS memory fragmentation.

Set Malloc Arena Max to reduce memory fragmentation

The GNU C library's (glibc's) malloc library contains a handful of functions that manage allocated memory in the application's address space.

malloc.png

This malloc is a "heap" style malloc, which means that chunks of various sizes exist within a larger region of memory (a "heap"). glibc's malloc allows for multiple heaps, each of which grows within its address space.

An arena is a structure that is shared among one or more threads which contains references to one or more heaps, as well as linked lists of chunks within those heaps which are "free". Threads assigned to each arena will allocate memory from that arena's free lists.

The workaround to reduce the memory fragmentation is to reduce the number of Malloc Arena.

Our benchmark with vanilla SAP Commerce Cloud Storefront stressed at very high load shows that the performance is not impacted by reducing the size.

Below is the outcome for 10 cores (default 80 malloc arenas).

Malloc size vs fragmentation.png

On 64-bit machine, the default is 8 arenas per core assigned to the service and we suggest reducing to 2 arenas per core.

For this, we need to set MALLOC_ARENA_MAX as a new environment variable for the container causing the issue in CCV2 using SAP Commerce Cloud Portal.

Please note this is not setting a JVM parameter but setting a new environment variable so for this change and it requires to customize the Catalina startup script.

Below is the procedure:

  • Override /opt/tomcat/bin/catalina.sh by configuring the service properties below

 

ccv2.file.override.70.content=<BASE64_CONTENT_CUSTOM_CATALINA>
ccv2.file.override.70.path=/opt/tomcat/bin/catalina.sh
ccv2.file.override.71.content=<BASE64_CONTENT_OF_ORIGINAL_FILE>
ccv2.file.override.71.path=/opt/tomcat/bin/catalina-original.sh

 

<BASE64_CONTENT_CUSTOM_CATALINA> is the content below encoded to Base64 format.

 

#!/usr/bin/env bash
set -e
echo "Set MALLOC ARENA MAX to 20"
export MALLOC_ARENA_MAX=20
chown hybris /opt/tomcat/bin/catalina-original.sh
chmod +x /opt/tomcat/bin/catalina-original.sh
exec /opt/tomcat/bin/catalina-original.sh run

 

For 64-bit machine, default value of MALLOC_ARENA_MAX is 8x the number of cores. 

To reduce memory defragmentation, we suggest setting 2x the number of cores for Service pod (ex: number of cores for Storefront service). Above script is setting MALLOC_ARENA_MAX value as 20 for a 10 cores pod.

<BASE64_CONTENT_OF_ORIGINAL_CATALINA_FILE> is the original /opt/tomcat/bin/catalina.sh encoded to Base64 format.

Note that in case of Tomcat upgrade, you need to encode again this value. For this, you can execute the following script in HAC after upgrade to check that the value did not change:

 

import org.apache.commons.io.FileUtils;
String contents = FileUtils.readFileToString(new File("/opt/tomcat/bin/catalina.sh"), "UTF-8")

 

Below is an example of properties with catalina-original.sh of SAP Commerce 2211.27 (Tomcat 9.0.86).

 

ccv2.file.override.70.content=IyEvdXNyL2Jpbi9lbnYgYmFzaApzZXQgLWUKZWNobyAiU2V0IE1BTExPQyBBUkVOQSBNQVggdG8gMjAiCmV4cG9ydCBNQUxMT0NfQVJFTkFfTUFYPTIwCmNob3duIGh5YnJpcyAvb3B0L3RvbWNhdC9iaW4vY2F0YWxpbmEtb3JpZ2luYWwuc2gKY2htb2QgK3ggL29wdC90b21jYXQvYmluL2NhdGFsaW5hLW9yaWdpbmFsLnNoCmV4ZWMgL29wdC90b21jYXQvYmluL2NhdGFsaW5hLW9yaWdpbmFsLnNoIHJ1bg==
ccv2.file.override.70.path=/opt/tomcat/bin/catalina.sh
ccv2.file.override.71.content=
ccv2.file.override.71.path=/opt/tomcat/bin/catalina-original.sh

 

  • After the change of the service configuration, the service should be restarted to be effective
  • To confirm the environment variable has been set, you can execute the following Groovy script on HAC on the right service.

 

def envVars = System.getenv()
envVars.each { key, value ->
    if(key.equals("MALLOC_ARENA_MAX"))
    {
        println("${key} = ${value}")
    }
}

 

7 Comments
ductran
Explorer
0 Kudos

Hi @YannickRobin,

Thanks for sharing, I went thru the article and have 2 questions. It would be great that you could follow up please:

- Given example of "export MALLOC_ARENA_MAX=8" does it mean this node is having 4 cores? I am confused that in before those settings your environment has been configured with 10 cores, which supposes to be resulting to "export MALLOC_ARENA_MAX=20"? Can I confirm that "10 cores" and "export MALLOC_ARENA_MAX=8" are different environments?

- I understand that we are about to intervene MALLOC_ARENA_MAX by setting up 2 catalina versions. Decoding base 64 value from ccv2.file.override.71.content I could see the the content is exactly identical with our catalina.sh as script below (scripted on hac)

import org.apache.commons.io.FileUtils;
String contents = FileUtils.readFileToString(new File("/opt/tomcat/bin/catalina.sh"), "UTF-8")

Now my concern is, if we force ccv2 to maintain override version of /opt/tomcat/bin/catalina-original.sh does that mean we are giving up the benefit of new catalina.sh would be aligned automatically once SAP release new update? Do we have any other way that allows to set that variable and not to touch on original version?

 

Thanks,

Duc.

YannickRobin
Product and Topic Expert
Product and Topic Expert

Hello Duc,

I've updated the post. Hope it clarifies your comments.

Yes, it requires changing the content in case of Tomcat upgrade.

Thanks,

Yannick

ductran
Explorer
0 Kudos

Thanks @YannickRobin for quick response. Now all questions are well explained.

Duc.

laurent-malvert
Product and Topic Expert
Product and Topic Expert
0 Kudos

Very interesting @YannickRobin ! 👏 Thanks for the write-up 🙂

Quick question (and forgive me if that's stupid as I've been out of the Commerce loop and the CCv2 stack specifics for some time): As the issue seems to be caused by the default arena sizing by glibc's allocator, as you point out in the article, have others allocators been considered ?

I know other cloud providers favor other allocators over glibc's. A few eons ago, tcmalloc was often mentioned as an alternative, or more recently mimalloc.

Maybe not something anyone can meddle with on CCv2 deployments, but maybe a point to carefully audit with the infrastructure team ?

YannickRobin
Product and Topic Expert
Product and Topic Expert

Indeed the problem has been escalated to Product team and they're evaluating various solutions to fix the problem. Using malloc alternative is part of the discussion.

This being said changing malloc arena size seems easy to rollout globally with limited drawback.

ductran
Explorer
0 Kudos

Hi Yannick,

Glad to know that Product team is following up, thanks for the update.

In the mean time we will try to tune down the malloc arena first. Please update us again if SAP comes up with permanent fix.

Thanks,

Duc.

ductran
Explorer

Hi,

I has applied the bash script into our cloud commerce, it seems to be working well. I just want to share how I can facilitate the configurations with less mistake.

From HAC, run the script via Platform -> Scripting languages.

Before you run please be mindful 2 things:

- This script should be run after you have upgraded Tomcat. As Yannick shared we have to update manually when Tomcat version is changed.

- If you want to make the settings for different application nodes (eg: running this script on backoffice but to produce the settings for backgroundprocessing), the cores might be different. In this case you can check the number of cores in Dynatrace, under Infrastructure dashboard.

 

// recommendation from SAP
def MALLOC_ARENA_FACTORS = 2;
//If you would like to get number of cores automatically from impacted application nodes
def cores = Runtime.getRuntime().availableProcessors();
//If you want to generate for other nodes from any Hybris HAC
//def cores = 10;
def maxMallocArenas = cores * MALLOC_ARENA_FACTORS;

def catalinaOverridePlainText = new StringBuilder();
catalinaOverridePlainText.append("#!/usr/bin/env bash\n");
catalinaOverridePlainText.append("set -e\n");
catalinaOverridePlainText.append(String.format("echo \"Set MALLOC ARENA MAX to %d\"\n", maxMallocArenas));
catalinaOverridePlainText.append(String.format("export MALLOC_ARENA_MAX=%d\n", maxMallocArenas));
catalinaOverridePlainText.append("chown hybris /opt/tomcat/bin/catalina-original.sh\n");
catalinaOverridePlainText.append("chmod +x /opt/tomcat/bin/catalina-original.sh\n");
catalinaOverridePlainText.append("exec /opt/tomcat/bin/catalina-original.sh run\n");

//if you are going to run on local or on-prems it is under /bin/platform/resources/tomcat-xxx/bin
def originalCatalinaPlainText = org.apache.commons.io.FileUtils.readFileToString(new File("/opt/tomcat/bin/catalina.sh"), "UTF-8");

println "Your override MALLOC_ARENA_MAX script is going to be:";
println catalinaOverridePlainText.toString();
println "You might want to copy those lines onto ccv2 portal - under environment settings";
println "##############";
println String.format("ccv2.file.override.70.content=%s", Base64.getEncoder().encodeToString(catalinaOverridePlainText.toString().getBytes("UTF-8")))
println "ccv2.file.override.70.path=/opt/tomcat/bin/catalina.sh";
println String.format("ccv2.file.override.71.content=%s", Base64.getEncoder().encodeToString(originalCatalinaPlainText.toString().getBytes("UTF-8")))
println "ccv2.file.override.71.path=/opt/tomcat/bin/catalina-original.sh";
println "##############";

 

 Hac might creates you something like this

ductran_0-1741313682866.png

Now you need to copy content amongst hashes and apply them onto ccv2 portal. You might want to check the plain content again from https://www.base64decode.org.

Hope it helps,

Duc.