
Database is a critical component for the scalability of SAP Commerce Cloud.
In this article, we will outline a methodology to reduce database utilization and enhance system scalability when the database is the bottleneck.
This methodology consists of five straightforward steps, each of which we will explain in detail:
You can access DTU usage with Dynatrace. For this analysis, we will follow these steps:
For the analysis, we use Used (average) metric with 1 min resolution. By default, the resolution may be higher (ex: 1 hour) and you may miss spikes using the average, this is why we change to 1 min. Please note Dynatrace will keep only up to 15 days of history with such resolution.
Used (max) metric can be very confusing for the analysis and we do not recommend using this metric for the analysis. It is normal for the database to utilize the maximum available hardware for short periods. When a spike lasts less than a minute, it is called a micro-spike, and it does not impact the system's overall performance. If you use a long resolution and rely on the max metric, even a single micro-spike per hour can make the graph appear alarming when this is not a concern.
The objective here is to drill down to the period of time where DTU is too high to identify the shortest period of time where we should focus for the next step.
DTU (Database Transaction Unit) is a measure used by Azure SQL Database to describe the relative capacity of a database. It is an abstraction that encapsulates a combination of CPU, memory, reads, and writes. Here is a breakdown of the resources behind DTU:
CPU (Compute) - The amount of CPU resources available for processing queries and operations.
Memory - The amount of RAM allocated for data caching, query execution, and other in-memory processes.
Data IO - The throughput of read operations from disk, which affects the speed at which data can be read into memory.
Log IO - The throughput of write operations to the transaction log, which impacts the speed at which transactions can be committed.
DTUs provide a simplified way to measure and scale the performance of Azure SQL Databases, allowing you to choose a performance level without having to worry about individual resource allocation.
When analyzing a potential bottleneck, it is crucial to identify which resource is limiting the system's performance.
In most of the cases, the resource contention will be the CPU (usually for frontend traffic) but in some cases it could be Data IO or Write IO (usually due to jobs).
To identify the contention, we will follow these steps:
SELECT
end_time,
avg_cpu_percent AS 'Average CPU Utilization In Percent',
avg_data_io_percent AS 'Average Data IO In Percent',
avg_log_write_percent AS 'Average Log Write I/O Throughput Utilization In Percent',
avg_memory_usage_percent AS 'Average Memory Usage In Percent'
FROM sys.dm_db_resource_stats
ORDER BY end_time ASC;
Below is an example of result. From this, you can identify the resource that is using the highest percentage of DTU.
You may want to copy to Excel and graph the result for better visualization.
Note: The query on DMV (dm_db_resource_stats) captures data every 15 sec with a retention period of one hour. Unfortunately you cannot analyse for further in the past. If you do not have the data, you can probably consider for the following step that your main contention is the CPU as this is generally the case.
To identify the top queries for a certain period of time, we will use hAC to extract information from Azure SQL Server query store.
For this investigation, follow these steps:
SELECT TOP 10
p.query_id query_id,
qt.query_sql_text query_sql_text,
ROUND(SUM(count_executions * avg_cpu_time / 1000.0),2) AS total_cpu_millisec,
ROUND(SUM(avg_logical_io_reads),2) total_logic_read_IO,
ROUND(SUM(avg_logical_io_writes),2) total_logic_write_IO,
ROUND(CONVERT(float, SUM(rs.avg_duration*rs.count_executions))*0.001,2) total_duration,
SUM(rs.count_executions) count_executions,
MAX(round(rs.max_cpu_time/1000,2)) as max_cpu_time_ms,
AVG(round([rs].[avg_duration]/1000,2))as avg_duration_ms_sec,
COUNT(distinct p.plan_id) num_plans
FROM sys.query_store_runtime_stats rs
JOIN sys.query_store_plan p ON p.plan_id = rs.plan_id
JOIN sys.query_store_query q ON q.query_id = p.query_id
JOIN sys.query_store_query_text qt ON q.query_text_id = qt.query_text_id
WHERE NOT (rs.last_execution_time < '2024-07-01 4:00:00' OR rs.first_execution_time > '2024-07-01 5:00:00')
GROUP BY p.query_id, qt.query_sql_text, q.object_id
HAVING COUNT(distinct p.plan_id) >= 1
ORDER BY total_cpu_millisec DESC
Please be very careful when running this query as the query store tables are large, it can hurt performance. In particular, do not select a time period too long (ideally 1 hour, max 24 hours).
Here is an example of result
In the example, we order by total_cpu_millisec but you can also order by total_logic_read_IO or total_logic_write_IO depending of your contention analysed at the previous step.
Note 1: The query on Query Store captures data every 60 minutes with a retention period of 30 days (max 100 MB).
Note 2: Why not using Top database statements metric provided by Dynatrace? Top database statements is quite useful to identify top queries that impact the total response time of the application. But for Dynatrace the database is a black-box and it can know the top queries for the resource that creates the contention. We need an internal view for this to make sure we focus on the right queries to optimize.
Now we're going to use Dynatrace to identify the code that is causing the execution of a particular query.
Note: In Dynatrace, this is normal to see two databases. This comes from the fact there are two gateways and Dynatrace identifies them as two different database even if they have the same name. By selecting Top database statements, you group the executions of each databases together.
Note 1: This is normal to have the query duplicated as Dynatrace identifies two databases
Note 2: Dynatrace records only the most frequent queries so you may not find your query if this is slow but executed few times only.
Note: The incoming requests on the left represents the number of HTTP requests and the outgoing calls on the right represents the number of query executions.
Note 1: This tab filters HTTP threads for the selected query. Unfortunately as HTTP threads are part of a pool and reused, it will include any stack traces that executes queries for these threads.
Note 2: Code-level data retention period is 10 days or maximum 100 GB for Dynatrace data storage in Production.
After identifying the top query and the corresponding code executing it, focus on optimizing the query or reducing the frequency of its execution. By systematically repeating this process for each query that significantly consumes resources, you will likely achieve a substantial reduction in your DTU usage.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
User | Count |
---|---|
4 | |
3 | |
2 | |
2 | |
2 | |
1 | |
1 | |
1 | |
1 | |
1 |