
This blog explains how to test a Windows Failover cluster environment. We will have a close look at the central services of an SAP system and if the cluster are correctly handle a failover. Failover tests are mandatory before a high availability (HA) solution can go live for production. Additionally, latest every 6 months a failover should be tested to verify, if the cluster works as expected?
Open SAP MMC and Failover Cluster Manager tool in parallel to see in realtime, what will happen.
In SAP MMC select the ASCS instance and stop it.
After the instance is stopped, you should see a picture like this:
Start the instance in SAP MMC again.
Expected result:
Goal:
The communication between SAP MMC <- sapstartsrv -> SAPRC.DLL works fine, there are no communication or authorization problems.
Open SAP MMC and Failover Cluster Manager tool in parallel to see in realtime, what will happen.
Check the last line in this ERS trace file on ALL cluster nodes (where ERS instances are running):
\\<hostname>\saploc\<SID>\ERSxx\work\dev_enrepha
You should see at this point in time:
Use the Failover Cluster Manager tool to move the SAP cluster group to another node.
These times are the “default” in your cluster setup. An unplanned failover should show same, or at least similar times!
After failover was done, check the status of the ERS trace files.
You should see after failover:
You can also check dev_enqsrv trace file. The Enqueue server should have detected the shadow replication table of the formerly active ERS and should use this table to get the locks of the SAP system.
Open SAP MMC and Failover Cluster Manager tool in parallel to see in realtime, what will happen.
Open Windows Task Manager, navigate to “Details” tab and select the SAP Message server process (msg_server.exe) or the SAP Enqueue server process (enserver.exe) from the list of processes.
Right mouse-click and then select “End Task”. This will kill (=terminate) the process.
The Failover Cluster Manager should detect this within a few seconds (~ 3 seconds max.). The cluster should take offline all resources of the SAP cluster group and should move the group to another cluster node. If you have configured “Possible Owners” to switch to a specific cluster node, then the cluster will move the group to that node (three or more cluster node scenarios).
Check the time how long it takes to bring the resources online:
Detailed information can be found in the cluster.log which you can generate in a PowerShell (with admin rights):
get-clusterlog -destination c:\temp -uselocaltime -timespan 30
(this will generate the cluster logs for all cluster nodes with the events of the last 30 minutes)
Goals:
Open SAP MMC and Failover Cluster Manager tool in parallel to see in realtime, what will happen.
In case you have a dedicated heartbeat network connection between the cluster nodes, this test will show, if the cluster will operate stable if the heartbeat network packages can only use the way through the public network interface.
On the current cluster node where the SAP cluster group is running, disable the heartbeat network interface in Windows.
Expected result:
Nothing should happen. Check the cluster logs, you should see related network / heartbeat errors on all cluster nodes. SAP operations must not be affected!
Same as test 3. But this time, disable all network card on the current cluster node!
Warning!
If you’re connected via RDP console to the Windows host, you can no longer connect to Windows. Make sure you use a hardware console (Remote Management Board or in case Windows is running in a VM, a console of the hypervisor). If you cannot connect via a console to Windows, you cannot carry out this test. In case you’re using physical hardware, you can alternatively unplug the cables from the network interface(s) of the physical server.
Expected result:
The other cluster node(s) should detect the loss of communication to this host. Based on the Quorum model you use, another cluster node should take over the SAP cluster group and continues operations.
Goal:
Download the tool “notmyfault.exe” from https://learn.microsoft.com/en-us/sysinternals/downloads/notmyfault .
This tool will initiate a crash (= Blue Screen or Stop) of the Windows OS.
Start tool with parameter “crash” on the cluster node, where SAP cluster group is running. After you have started the tool, the RDP session will hang, it has lost connection, because the OS is stopping (=> crashing).
Watch closely what happens on the other cluster node! The cluster node should take over the SAP cluster group after some seconds and NOT minutes!
Goal:
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
User | Count |
---|---|
24 | |
11 | |
10 | |
9 | |
9 | |
7 | |
7 | |
6 | |
6 | |
6 |