Background:
Since the release of SAP Data Intelligence Cloud:2110, we have an enhancement feature which allows you to monitor the performance of your application with the help of Grafana API. This post will give you a more illustrative hands-on overview other than the one mentioned in SAP Note.
Prerequisites:
Since this post is focusing on Grafana API, hence the deployment of your on-premises Grafana application, and the installation of vctl tool, are out of scope of it.
- <cluster address>: Address of the SAP Data Intelligence cluster running your tenant (same as used in the "vctl login" command).
- <username>: Tenant admin username (same as used in the "vctl login" command).
- <user password>: Tenant admin password (same as used in the "vctl login" command).
- <tenant name>: Tenant name (same as used in the "vctl login" command).
In my case, the cluster address is
https://vsystem.ingress.dh-7rh7z7ok4.dh-canary.shoot.live.k8s-hana.ondemand.com, and the tenant name is
default. You will also need the admin privilege to fetch the info of tenant id.
With the help of
"vctl tenant get" command, you'll get the id of your tenant, which will be useful to call Grafana API later. For more help on the SAP vctl tool, you may simply hit
"vctl --help" or refer to
Commands - SAP Help Portal.
Steps:
- Login to Grafana as an administrator. (e.g. If you install Grafana on your own laptop, the default url of Grafana UI should be http://127.0.0.1:3000)
- Select Configuration > Data Sources in the Grafana menu.
- Select Add data source.
- Select Time series databases > Prometheus
- Configure the SAP Data Intelligence Monitoring Query API as a Prometheus data source:
- Name: "SAP Data Intelligence"
- Default: "enabled"
- URL: "https://<cluster address>/app/diagnostics-gateway/monitoring/query"
- Access: "Server (default)"
- Auth > Basic Auth: "enabled"
- Auth > With Credentials: "enabled"
- Basic Auth Details > User: "<tenant name>\<username>"
- Basic Auth Details > Password: "<user password>"
- Custom HTTP Headers > Header: "x-requested-with"
- Custom HTTP Headers > Value: "fetch"
- HTTP Method: "POST"
- Click Save & Test.
Good job, now we have a working data source, to fetch the diagnostic metrics of our SAP Data Intelligence Cloud, we can move on to customize our dashboard based on the business requirements.
What's more:
For your convenience, I simply copy and paste some PromQL commands that can help you fulfill that quickly. In below samples,
- ${SAP_DI_TENANT_UID} indicates the tenant id that you have fetched with "vctl tenant get" command.
- ensure that all the special characters inside is in English format. ( " and , )
- remove SAP_DI_QUERY= before each command.
-----------------------------------------------------------------------------------------
Basic Pod Performance Metrics Usage
Pod memory usage in bytes:
SAP_DI_QUERY="sap_pod_memory_working_set_bytes{access_category="pod-performance",vsystem_datahub_sap_com_tenant_uid="${SAP_DI_TENANT_UID}"}"
Pod CPU cores usage:
SAP_DI_QUERY="rate(sap_pod_cpu_user_seconds_total{access_category="pod-performance",vsystem_datahub_sap_com_tenant_uid="${SAP_DI_TENANT_UID}"}[5m])"
The `rate` function is applied because the base metric `sap_pod_cpu_user_seconds_total` records the total CPU usage over the lifetime of a pod and is not very informative on its own. For the expression above, a value of `0.1` corresponds to an average usage of `1/10th` of the CPU time of a single core over the past five minutes, while a value of `2` corresponds to an average usage of two full CPU cores.
Pod network usage as bytes per second:
SAP_DI_QUERY="rate(sap_pod_network_bytes_total{access_category="pod-performance",vsystem_datahub_sap_com_tenant_uid="${SAP_DI_TENANT_UID}"}[5m])"
The `rate` function is applied because the base metric `sap_pod_network_bytes_total` records the total network usage over the lifetime of a pod and is not very informative on its own. The network usage as bytes per second is computed over the past five minutes.
Pod readiness status:
SAP_DI_QUERY="sap_pod_status_ready{access_category="pod-performance",vsystem_datahub_sap_com_tenant_uid="${SAP_DI_TENANT_UID}"}"
This is a zero-one metric. The value `1` represents a ready pod, the value `0` a non-ready pod.
Smoothed pod readiness status:
SAP_DI_QUERY="avg_over_time(sap_pod_status_ready{access_category="pod-performance",vsystem_datahub_sap_com_tenant_uid="${SAP_DI_TENANT_UID}"}[5m])"
The result is a sliding window average of the pod readiness status over the past five minutes (based on four or five samples due to a sample resolution of one minute). The values lie between zero (pod not ready for the past five minutes) and one (pod ready for the past five minutes). This metric is suitable to define an alert threshold for example at `0.7`, allowing for the pod to be not ready for a minute during restarts without raising an alert.
> Note: Do not set the time interval for `rate` or `avg_over_time` functions in your PromQL expressions below `5m`. The minimum time series resolution of the queried metrics is at least one minute (see [`Metric resolution and retention`](#metric-resolution-and-retention)). This means, that for an interval of five minutes, the `rate` or `avg_over_time` functions already are based only on four samples points. Reducing the interval further may result in too few samples to calculate these functions.
-----------------------------------------------------------------------------------------
Tenant Pod Performance
Total memory usage in bytes of all pods of the tenant:
SAP_DI_QUERY="sum(sap_pod_memory_working_set_bytes{access_category="pod-performance",vsystem_datahub_sap_com_tenant_uid="${SAP_DI_TENANT_UID}"})"
Total CPU cores usage of all pods of the tenant:
SAP_DI_QUERY="sum(rate(sap_pod_cpu_user_seconds_total{access_category="pod-performance",vsystem_datahub_sap_com_tenant_uid="${SAP_DI_TENANT_UID}"}[5m]))"
Total network usage as bytes per second of all pods of the tenant:
SAP_DI_QUERY="sum(rate(sap_pod_network_bytes_total{access_category="pod-performance",vsystem_datahub_sap_com_tenant_uid="${SAP_DI_TENANT_UID}"}[5m]))"
Total pod count:
SAP_DI_QUERY="count(sap_pod_status_ready{access_category="pod-performance",vsystem_datahub_sap_com_tenant_uid="${SAP_DI_TENANT_UID}"})"
Total count of ready pods:
SAP_DI_QUERY="sum(sap_pod_status_ready{access_category="pod-performance",vsystem_datahub_sap_com_tenant_uid="${SAP_DI_TENANT_UID}"})"
-----------------------------------------------------------------------------------------
User Pod Performance
Total memory usage in bytes for each user:
SAP_DI_QUERY="sum(sap_pod_memory_working_set_bytes{access_category="pod-performance",vsystem_datahub_sap_com_tenant_uid="${SAP_DI_TENANT_UID}",vsystem_datahub_sap_com_user!=""}) by (vsystem_datahub_sap_com_user)"
Total CPU cores usage for each user:
SAP_DI_QUERY="sum(rate(sap_pod_cpu_user_seconds_total{access_category="pod-performance",vsystem_datahub_sap_com_tenant_uid="${SAP_DI_TENANT_UID}",vsystem_datahub_sap_com_user!=""}[5m])) by (vsystem_datahub_sap_com_user)"
Total network usage as bytes per second for each user:
SAP_DI_QUERY="sum(rate(sap_pod_network_bytes_total{access_category="pod-performance",vsystem_datahub_sap_com_tenant_uid="${SAP_DI_TENANT_UID}",vsystem_datahub_sap_com_user!=""}[5m])) by (vsystem_datahub_sap_com_user)"
Total pod count for each user:
SAP_DI_QUERY="count(sap_pod_status_ready{access_category="pod-performance",vsystem_datahub_sap_com_tenant_uid="${SAP_DI_TENANT_UID}",vsystem_datahub_sap_com_user!=""}) by (vsystem_datahub_sap_com_user)"
-----------------------------------------------------------------------------------------
Pipeline Graph Performance
Total memory usage in bytes of all pods for each (multi-pod) graph:
SAP_DI_QUERY="sum(sap_pod_memory_working_set_bytes{access_category="pod-performance",vsystem_datahub_sap_com_tenant_uid="${SAP_DI_TENANT_UID}",graph!=""}) by (graph)"
Total CPU cores usage of all pods for each (multi-pod) graph:
SAP_DI_QUERY="sum(rate(sap_pod_cpu_user_seconds_total{access_category="pod-performance",vsystem_datahub_sap_com_tenant_uid="${SAP_DI_TENANT_UID}",graph!=""}[5m])) by (graph)"
Total network usage as bytes per second for each (multi-pod) graph:
SAP_DI_QUERY="sum(rate(sap_pod_network_bytes_total{access_category="pod-performance",vsystem_datahub_sap_com_tenant_uid="${SAP_DI_TENANT_UID}",graph!=""}[5m])) by (graph)"
Readiness status for each (multi-pod) graph:
SAP_DI_QUERY="max(sap_pod_status_ready{access_category="pod-performance",vsystem_datahub_sap_com_tenant_uid="${SAP_DI_TENANT_UID}",graph!=""}) by (graph)"
-----------------------------------------------------------------------------------------
Here is the snapshot of one query:
Once you have customized your dashboard, you will get a graphical overview of your SAP Data Intelligence Cloud.
Kindly be informed the Grafana UI could differ from each other, depending on the version of Grafana installed.
The suggested version (as the screenshots in this article) is v7.5.14. Otherwise you will need to tweak it a bit to run the query properly in Explore panel.
Cheers! Now enjoy your journey in Grafana!
In case any unclear, feel free to comment or reach out to me directly.