CRM and CX Blogs by SAP
Stay up-to-date on the latest developments and product news about intelligent customer experience and CRM technologies through blog posts from SAP experts.
cancel
Showing results for 
Search instead for 
Did you mean: 
YannickRobin
Product and Topic Expert
Product and Topic Expert
6,999

YannickRobin_0-1719908577836.jpeg

Database is a critical component for the scalability of SAP Commerce Cloud.

In this article, we will outline a methodology to reduce database utilization and enhance system scalability when the database is the bottleneck.

This methodology consists of five straightforward steps, each of which we will explain in detail:

  1. Monitor the Database Consumption: Assess the current database usage to establish a baseline before optimization.
  2. Identify Top Waits: Determine the primary causes of the contention to prioritize areas for improvement.
  3. Identify Top Queries: Pinpoint the most resource-intensive queries to target for optimization.
  4. Drill down to the Code Level: Examine the Java stack trace to identify the code to optimize.
  5. Optimize the Code: Reduce the number of executions or improve the top queries.

 

Monitor the Database Consumption

You can access DTU usage with Dynatrace. For this analysis, we will follow these steps:

  • Go to Observe and Explore > Dashboards > SAP – Infra Usage dashboard
  • Go to DB Usage - DTU graph and click Configure in data explorer

Dynatrace Configure in data explorer.png

  • Configure the timeframe

Dynatrace Resolution.png

  • Change the resolution to 1 min
  • Hide Used (max) metrics

Dynatrace Result.png

For the analysis, we use Used (average) metric with 1 min resolution. By default, the resolution may be higher (ex: 1 hour) and you may miss spikes using the average, this is why we change to 1 min. Please note Dynatrace will keep only up to 15 days of history with such resolution.

Used (max) metric can be very confusing for the analysis and we do not recommend using this metric for the analysis. It is normal for the database to utilize the maximum available hardware for short periods. When a spike lasts less than a minute, it is called a micro-spike, and it does not impact the system's overall performance. If you use a long resolution and rely on the max metric, even a single micro-spike per hour can make the graph appear alarming when this is not a concern.

The objective here is to drill down to the period of time where DTU is too high to identify the shortest period of time where we should focus for the next step.

Identify Top Waits

DTU (Database Transaction Unit) is a measure used by Azure SQL Database to describe the relative capacity of a database. It is an abstraction that encapsulates a combination of CPU, memory, reads, and writes. Here is a breakdown of the resources behind DTU:

  • CPU (Compute) - The amount of CPU resources available for processing queries and operations.

  • Memory - The amount of RAM allocated for data caching, query execution, and other in-memory processes.

  • Data IO - The throughput of read operations from disk, which affects the speed at which data can be read into memory.

  • Log IO - The throughput of write operations to the transaction log, which impacts the speed at which transactions can be committed.

DTUs provide a simplified way to measure and scale the performance of Azure SQL Databases, allowing you to choose a performance level without having to worry about individual resource allocation.

When analyzing a potential bottleneck, it is crucial to identify which resource is limiting the system's performance.

In most of the cases, the resource contention will be the CPU (usually for frontend traffic) but in some cases it could be Data IO or Write IO (usually due to jobs).

To identify the contention, we will follow these steps:

  • Connect to hAC
  • Go to Console > Flexible Search SQL query
  • Execute the following direct SQL query

 

 

SELECT
    end_time,
    avg_cpu_percent AS 'Average CPU Utilization In Percent',
    avg_data_io_percent AS 'Average Data IO In Percent',
    avg_log_write_percent AS 'Average Log Write I/O Throughput Utilization In Percent',
    avg_memory_usage_percent AS 'Average Memory Usage In Percent'
FROM sys.dm_db_resource_stats 
ORDER BY end_time ASC;

 

 

Below is an example of result. From this, you can identify the resource that is using the highest percentage of DTU.

Resource usage.png

You may want to copy to Excel and graph the result for better visualization.

Note: The query on DMV (dm_db_resource_stats) captures data every 15 sec with a retention period of one hour. Unfortunately you cannot analyse for further in the past. If you do not have the data, you can probably consider for the following step that your main contention is the CPU as this is generally the case.

 

Identify Top Queries

To identify the top queries for a certain period of time, we will use hAC to extract information from Azure SQL Server query store.

For this investigation, follow these steps:

  • Connect to hAC
  • Go to Console > Flexible Search SQL query
  • Execute the following direct SQL query after replacing start time and end time in UTC

 

 

SELECT TOP 10
    p.query_id query_id,
    qt.query_sql_text query_sql_text,
    ROUND(SUM(count_executions * avg_cpu_time / 1000.0),2) AS total_cpu_millisec,
    ROUND(SUM(avg_logical_io_reads),2) total_logic_read_IO,
    ROUND(SUM(avg_logical_io_writes),2) total_logic_write_IO,
    ROUND(CONVERT(float, SUM(rs.avg_duration*rs.count_executions))*0.001,2) total_duration,
    SUM(rs.count_executions) count_executions,  
    MAX(round(rs.max_cpu_time/1000,2)) as max_cpu_time_ms,
    AVG(round([rs].[avg_duration]/1000,2))as avg_duration_ms_sec,
    COUNT(distinct p.plan_id) num_plans 
FROM sys.query_store_runtime_stats rs
    JOIN sys.query_store_plan p ON p.plan_id = rs.plan_id
    JOIN sys.query_store_query q ON q.query_id = p.query_id
    JOIN sys.query_store_query_text qt ON q.query_text_id = qt.query_text_id
WHERE NOT (rs.last_execution_time < '2024-07-01 4:00:00' OR rs.first_execution_time > '2024-07-01 5:00:00')
GROUP BY p.query_id, qt.query_sql_text, q.object_id
HAVING COUNT(distinct p.plan_id) >= 1
ORDER BY total_cpu_millisec DESC

 

Please be very careful when running this query as the query store tables are large, it can hurt performance. In particular, do not select a time period too long (ideally 1 hour, max 24 hours).

Here is an example of result

top_queries_per_cpu.png

In the example, we order by total_cpu_millisec but you can also order by total_logic_read_IO or total_logic_write_IO depending of your contention analysed at the previous step.

Note 1: The query on Query Store captures data every 60 minutes with a retention period of 30 days (max 100 MB).

Note 2: Why not using Top database statements metric provided by Dynatrace? Top database statements is quite useful to identify top queries that impact the total response time of the application. But for Dynatrace the database is a black-box and it can know the top queries for the resource that creates the contention. We need an internal view for this to make sure we focus on the right queries to optimize.

 

Drill down to the Code Level

Now we're going to use Dynatrace to identify the code that is causing the execution of a particular query.

  • Go to Application Observability > Database Services
  • On the top level, select the time window of the analysis

YannickRobin_5-1719902683740.png

Note: In Dynatrace, this is normal to see two databases. This comes from the fact there are two gateways and Dynatrace identifies them as two different database even if they have the same name. By selecting Top database statements, you group the executions of each databases together.

  • Click on Top database statements
  • Click on the Filter requests search field, select Database statement and search for your query

Screenshot 2024-07-02 at 3.25.23 PM.png

  • Here is an example of result

YannickRobin_4-1719905277670.png

Note 1: This is normal to have the query duplicated as Dynatrace identifies two databases

Note 2: Dynatrace records only the most frequent queries so you may not find your query if this is slow but executed few times only.

  • Click Actions > ... > Service Backtrace

YannickRobin_2-1719904491010.png

  • Select the web application (ex: yacceleratorstorefront)
  • Check "Requests from" section to understand what are the HTTP requests calling this request (if require you can filter for a specific HTTP request)

YannickRobin_0-1719906193888.png

Note: The incoming requests on the left represents the number of HTTP requests and the outgoing calls on the right represents the number of query executions. 

  • Click on Stacktrace tab and drill down to understand what is the code calling this query

Screenshot 2024-07-02 at 3.32.41 PM.png

Note 1: This tab filters HTTP threads for the selected query. Unfortunately as HTTP threads are part of a pool and reused, it will include any stack traces that executes queries for these threads.

Note 2: Code-level data retention period is 10 days or maximum 100 GB for Dynatrace data storage in Production.

 

Conclusion

After identifying the top query and the corresponding code executing it, focus on optimizing the query or reducing the frequency of its execution. By systematically repeating this process for each query that significantly consumes resources, you will likely achieve a substantial reduction in your DTU usage.

6 Comments
YannickRobin
Product and Topic Expert
Product and Topic Expert

Here is how to identify the top wait categories from query store:

 

SELECT
    wait_category_desc,
    sum(total_query_wait_time_ms) as total_query_wait_time_ms
FROM
    sys.query_store_wait_stats qsws INNER JOIN
    sys.query_store_runtime_stats_interval qsrsi ON qsws.runtime_stats_interval_id = qsrsi.runtime_stats_interval_id
WHERE
    qsrsi.start_time >= '2024-07-01 4:00:00' AND qsrsi.end_time <= '2024-07-01 5:00:00'
GROUP BY wait_category_desc
ORDER BY total_query_wait_time_ms DESC

 

 

top_wait_category.png

The frequency is 60 min and the retention is 30 days.

David_A_C
Participant

Thanks for the tip. I would like to add that is common suffering alerts on DTU Read Only Replica. You need to pay attetion to the datasource (master/readonly) in hAC

dgbeale
Newcomer

Great write up, Yannick, very useful for us.

I note that you ignore 'Used (max)' for the analysis, but do you know if this metric is used by SAP/Azure when calculating DTU usage.

 

 

YannickRobin
Product and Topic Expert
Product and Topic Expert

@David_A_C You have a good point. Unfortunately there is no query store on replica so not easy to investigate issues

@dgbeale Are you referring to DTU consumption reported in SAP4Me to compare with entitlement? If this is the case, DTU provisioned is reported and not DTU used.

gregorkofler
Explorer
0 Kudos

Thank you Yannick for the valuable blog post.

We are struggling with DB deadlocks and in Dynatrace we just see the amount of DB deadlocks.
Unfortunately we can't determine which tables/queries are involved. 

Is there any blog how to resolve DB deadlocks and find the root cause in the code?
Is there any SQL statement to find out the involved DB table of DB deadlocks?

Raktim_Pratihar
Explorer
0 Kudos

Hi, 

if you are using Azure SQL, then please enable Azure SQL tab, which shows the data : 

raktim_0-1734546169448.png