CRM and CX Blogs by SAP
Stay up-to-date on the latest developments and product news about intelligent customer experience and CRM technologies through blog posts from SAP experts.
cancel
Showing results for 
Search instead for 
Did you mean: 
YannickRobin
Product and Topic Expert
Product and Topic Expert
6,357

YannickRobin_0-1719908577836.jpeg

Database is a critical component for the scalability of SAP Commerce Cloud.

In this article, we will outline a methodology to reduce database utilization and enhance system scalability when the database is the bottleneck.

This methodology consists of five straightforward steps, each of which we will explain in detail:

  1. Monitor the Database Consumption: Assess the current database usage to establish a baseline before optimization.
  2. Identify Top Waits: Determine the primary causes of the contention to prioritize areas for improvement.
  3. Identify Top Queries: Pinpoint the most resource-intensive queries to target for optimization.
  4. Drill down to the Code Level: Examine the Java stack trace to identify the code to optimize.
  5. Optimize the Code: Reduce the number of executions or improve the top queries.

 

Monitor the Database Consumption

You can access DTU usage with Dynatrace. For this analysis, we will follow these steps:

  • Go to Observe and Explore > Dashboards > SAP – Infra Usage dashboard
  • Go to DB Usage - DTU graph and click Configure in data explorer

Dynatrace Configure in data explorer.png

  • Configure the timeframe

Dynatrace Resolution.png

  • Change the resolution to 1 min
  • Hide Used (max) metrics

Dynatrace Result.png

For the analysis, we use Used (average) metric with 1 min resolution. By default, the resolution may be higher (ex: 1 hour) and you may miss spikes using the average, this is why we change to 1 min. Please note Dynatrace will keep only up to 15 days of history with such resolution.

Used (max) metric can be very confusing for the analysis and we do not recommend using this metric for the analysis. It is normal for the database to utilize the maximum available hardware for short periods. When a spike lasts less than a minute, it is called a micro-spike, and it does not impact the system's overall performance. If you use a long resolution and rely on the max metric, even a single micro-spike per hour can make the graph appear alarming when this is not a concern.

The objective here is to drill down to the period of time where DTU is too high to identify the shortest period of time where we should focus for the next step.

Identify Top Waits

DTU (Database Transaction Unit) is a measure used by Azure SQL Database to describe the relative capacity of a database. It is an abstraction that encapsulates a combination of CPU, memory, reads, and writes. Here is a breakdown of the resources behind DTU:

  • CPU (Compute) - The amount of CPU resources available for processing queries and operations.

  • Memory - The amount of RAM allocated for data caching, query execution, and other in-memory processes.

  • Data IO - The throughput of read operations from disk, which affects the speed at which data can be read into memory.

  • Log IO - The throughput of write operations to the transaction log, which impacts the speed at which transactions can be committed.

DTUs provide a simplified way to measure and scale the performance of Azure SQL Databases, allowing you to choose a performance level without having to worry about individual resource allocation.

When analyzing a potential bottleneck, it is crucial to identify which resource is limiting the system's performance.

In most of the cases, the resource contention will be the CPU (usually for frontend traffic) but in some cases it could be Data IO or Write IO (usually due to jobs).

To identify the contention, we will follow these steps:

  • Connect to hAC
  • Go to Console > Flexible Search SQL query
  • Execute the following direct SQL query

 

 

SELECT
    end_time,
    avg_cpu_percent AS 'Average CPU Utilization In Percent',
    avg_data_io_percent AS 'Average Data IO In Percent',
    avg_log_write_percent AS 'Average Log Write I/O Throughput Utilization In Percent',
    avg_memory_usage_percent AS 'Average Memory Usage In Percent'
FROM sys.dm_db_resource_stats 
ORDER BY end_time ASC;

 

 

Below is an example of result. From this, you can identify the resource that is using the highest percentage of DTU.

Resource usage.png

You may want to copy to Excel and graph the result for better visualization.

Note: The query on DMV (dm_db_resource_stats) captures data every 15 sec with a retention period of one hour. Unfortunately you cannot analyse for further in the past. If you do not have the data, you can probably consider for the following step that your main contention is the CPU as this is generally the case.

 

Identify Top Queries

To identify the top queries for a certain period of time, we will use hAC to extract information from Azure SQL Server query store.

For this investigation, follow these steps:

  • Connect to hAC
  • Go to Console > Flexible Search SQL query
  • Execute the following direct SQL query after replacing start time and end time in UTC

 

 

SELECT TOP 10
    p.query_id query_id,
    qt.query_sql_text query_sql_text,
    ROUND(SUM(count_executions * avg_cpu_time / 1000.0),2) AS total_cpu_millisec,
    ROUND(SUM(avg_logical_io_reads),2) total_logic_read_IO,
    ROUND(SUM(avg_logical_io_writes),2) total_logic_write_IO,
    ROUND(CONVERT(float, SUM(rs.avg_duration*rs.count_executions))*0.001,2) total_duration,
    SUM(rs.count_executions) count_executions,  
    MAX(round(rs.max_cpu_time/1000,2)) as max_cpu_time_ms,
    AVG(round([rs].[avg_duration]/1000,2))as avg_duration_ms_sec,
    COUNT(distinct p.plan_id) num_plans 
FROM sys.query_store_runtime_stats rs
    JOIN sys.query_store_plan p ON p.plan_id = rs.plan_id
    JOIN sys.query_store_query q ON q.query_id = p.query_id
    JOIN sys.query_store_query_text qt ON q.query_text_id = qt.query_text_id
WHERE NOT (rs.last_execution_time < '2024-07-01 4:00:00' OR rs.first_execution_time > '2024-07-01 5:00:00')
GROUP BY p.query_id, qt.query_sql_text, q.object_id
HAVING COUNT(distinct p.plan_id) >= 1
ORDER BY total_cpu_millisec DESC

 

Please be very careful when running this query as the query store tables are large, it can hurt performance. In particular, do not select a time period too long (ideally 1 hour, max 24 hours).

Here is an example of result

top_queries_per_cpu.png

In the example, we order by total_cpu_millisec but you can also order by total_logic_read_IO or total_logic_write_IO depending of your contention analysed at the previous step.

Note 1: The query on Query Store captures data every 60 minutes with a retention period of 30 days (max 100 MB).

Note 2: Why not using Top database statements metric provided by Dynatrace? Top database statements is quite useful to identify top queries that impact the total response time of the application. But for Dynatrace the database is a black-box and it can know the top queries for the resource that creates the contention. We need an internal view for this to make sure we focus on the right queries to optimize.

 

Drill down to the Code Level

Now we're going to use Dynatrace to identify the code that is causing the execution of a particular query.

  • Go to Application Observability > Database Services
  • On the top level, select the time window of the analysis

YannickRobin_5-1719902683740.png

Note: In Dynatrace, this is normal to see two databases. This comes from the fact there are two gateways and Dynatrace identifies them as two different database even if they have the same name. By selecting Top database statements, you group the executions of each databases together.

  • Click on Top database statements
  • Click on the Filter requests search field, select Database statement and search for your query

Screenshot 2024-07-02 at 3.25.23 PM.png

  • Here is an example of result

YannickRobin_4-1719905277670.png

Note 1: This is normal to have the query duplicated as Dynatrace identifies two databases

Note 2: Dynatrace records only the most frequent queries so you may not find your query if this is slow but executed few times only.

  • Click Actions > ... > Service Backtrace

YannickRobin_2-1719904491010.png

  • Select the web application (ex: yacceleratorstorefront)
  • Check "Requests from" section to understand what are the HTTP requests calling this request (if require you can filter for a specific HTTP request)

YannickRobin_0-1719906193888.png

Note: The incoming requests on the left represents the number of HTTP requests and the outgoing calls on the right represents the number of query executions. 

  • Click on Stacktrace tab and drill down to understand what is the code calling this query

Screenshot 2024-07-02 at 3.32.41 PM.png

Note 1: This tab filters HTTP threads for the selected query. Unfortunately as HTTP threads are part of a pool and reused, it will include any stack traces that executes queries for these threads.

Note 2: Code-level data retention period is 10 days or maximum 100 GB for Dynatrace data storage in Production.

 

Conclusion

After identifying the top query and the corresponding code executing it, focus on optimizing the query or reducing the frequency of its execution. By systematically repeating this process for each query that significantly consumes resources, you will likely achieve a substantial reduction in your DTU usage.

6 Comments