Introduction
Pacemaker’s role is to automate safe failover. It continuously monitors node health, HANA process health, and the replication state between sites. If the primary becomes unavailable, the cluster can promote the standby to primary and move the client access endpoint (typically a virtual IP or load balancer address) so applications reconnect without configuration changes. To avoid the most dangerous failure mode, dual-primary or “split brain,” the design depends on reliable fencing (STONITH) and a quorum strategy that ensures the cluster only makes promotion decisions when it can guarantee that the old primary is truly stopped.
This document describes high availability for SAP HANA using Pacemaker on Linux, focusing on a two‑site, active/passive database design with system replication and automatic failover. It explains how Pacemaker, Corosync, fencing mechanisms, and HANA system replication work together to provide fast recovery, protect against split‑brain, and ensure that failover only occurs when data is safely in sync.
Content Overview
- Introduction
- Content Overview
- Key Terms
- SAP ECS (RISE) reference architecture
- HANA log shipping and log replay
- Failover Scenarios
- Summary
- Documentation
Key Terms
Fencing (STONITH)
Fencing means forcibly removing a node from service so it cannot access shared resources or continue serving clients (classic “Shoot The Other Node In The Head”). In HANA SR clusters, fencing is primarily about preventing two primaries (data corruption risk).
Common fencing methods:
- SBD/watchdog (very common on SLES for SAP): watchdog-triggered self-fence, optionally backed by shared/diskless SBD. SUSE explicitly stresses watchdog usage for SBD reliability.
- IPMI/iLO, cloud fence agents, etc. SUSE provides IPMI examples.
Promotion
Pacemaker models “who is primary” as a promotable resource (historically “master/slave”). Promotion is the action: secondary is promoted to primary (takeover).
Failover
Clients (ABAP, tools, interfaces) connect to a VIP (or a cloud load balancer fronting an IP). Pacemaker moves the VIP to the node that currently hosts the primary HANA, so applications do not need to change connection endpoints. SUSE’s guide shows adding the VIP as an IPaddr2 resource and notes extra settings in public cloud.
Quorum Design
Quorum is “can this cluster make decisions safely.” Two-node clusters are special:
- Many setups use two_node=1 and a fencing-first policy: you can run with 2 nodes, but loss of communication must result in fencing to avoid split brain. SUSE shows the corosync votequorum two-node configuration.
- Some environments add a qdevice / “majority maker” in a third location (common for strict quorum semantics), especially for stretched scenarios.
SAP ECS (RISE) reference architecture
The exact details depend on the contractual SLA's.
DB layer
- Site A: HANA primary
- Site B: HANA secondary (replication target)
- HANA SR runs in performance-optimized style in many HA designs: one primary, one hot standby secondary continuously receiving logs (log shipping + log replay).

Cluster layer (Pacemaker + Corosync)
- Two nodes (one per site) form a Pacemaker cluster with Corosync membership and votequorum.
- Cluster manages:
- Start/stop of HANA
- Promotion logic
- Placement of VIP
- “Do not fail over if unsafe” rules
On SLES-for-SAP, SUSE provides dedicated SAP HANA SR resource agents and a detailed end-to-end guide.
On RHEL for SAP Solutions, Red Hat provides SAP HANA SR automation guidance and the SAP HANA-specific agents via its SAP Solutions repositories.
Fencing layer
Mandatory. If Pacemaker cannot guarantee only one side is active primary, it fences.
SUSE documents SBD/watchdog requirements and alternative STONITH methods (SBD deterministic delays, IPMI examples).
Cloud patterns (for context) often use provider fencing plus VIP via internal load balancers.
Quorum strategy
- Often two-node votequorum mode plus fencing-first behavior. SUSE shows two_node=1 configuration.
- Some RISE landscapes use an additional quorum mechanism (3rd site/qdevice) depending on strictness and infrastructure.
VIP / client access
VIP is controlled by Pacemaker, tied by constraints to the promoted primary. SUSE shows VIP resource configuration and constraints.
Application layer (ABAP app servers in both sites)
- App servers in both sites are active and typically use:
- DB connect via VIP (or LB address)
- Logon groups and message server for user distribution
- After a DB failover, app servers reconnect to the same endpoint, but reach the new primary.
Why SAP ECS uses this setup
- Fast RTO: automate takeover and VIP move.
- Data integrity: “fail over only if safe” and fence to prevent dual primary.
- Operational consistency: standardized patterns across customers, OS-vendor-supported tooling (SUSE best practices, Red Hat SAP Solutions guidance).
RHEL vs SLES differences
Common core: Pacemaker + Corosync + STONITH + VIP + HANA SR agents exist on both.
Key differences you will notice in real projects:
- SLES for SAP
- Strong “SAPHanaSR” ecosystem and the newer SAPHanaSR-angi, recommended for new deployments on SLES-for-SAP 15 SP4+ and later.
- SBD/watchdog is very commonly used and heavily documented.
- YaST tooling exists for cluster setup in SLES-for-SAP (convenience, not a requirement).
- RHEL for SAP Solutions
- SAP HANA SR automation is delivered via Red Hat’s SAP Solutions content and packages, and Red Hat documents extra considerations (for example systemd-based SAP startup interactions).
HANA log shipping and log replay
How HANA log shipping and log replay work
In a two-site HANA System Replication setup (performance-optimized), replication is based on redo logs, not full data pages.
Step-by-step process:
- Transaction commit on primary
- A user transaction changes data in memory.
- The change is written to the HANA redo log buffer.
- On commit, the redo log is flushed to the primary’s log volume.
- Log shipping
- The primary continuously streams redo log entries to the secondary over the SR channel.
- This happens at kernel level, not via SQL or file copy.
- Shipping can be synchronous or asynchronous depending on configuration.
- Log replay on secondary
- The secondary receives redo entries.
- It writes them to its own log volume.
- The redo entries are then replayed into the secondary’s memory structures.
- As long as replay keeps up, the secondary is transactionally consistent with the primary.
The important distinction:
- Log shipped means the redo arrived.
- Log replayed means the redo has been applied and the in-memory database state matches the primary.
In performance-optimized async mode, there can be a small delay between primary commit and replay on the secondary. That delay defines your RPO exposure.
How Pacemaker “knows” everything has been replicated
Pacemaker does not inspect database blocks or compare data pages. It relies entirely on HANA System Replication state interfaces exposed by HANA and interpreted by the HANA SR resource agents (for example SAPHanaSR / SAPHanaSR-angi on SLES or the Red Hat equivalents).
The logic works like this:
- Resource agent queries HANA SR status
- The agent regularly calls HANA internal interfaces.
- It checks:
- Role (primary or secondary)
- Replication mode
- Connection status
- Replay status
- Synchronization state
- HANA classifies replication safety
HANA internally classifies the replication state into categories such as:- Active / connected
- Synchronized / in sync
- Catch-up
- Error / disconnected
- Agent maps SR state to cluster attributes
The resource agent translates the SR state into cluster attributes like:- “can_promote”
- “sync_state”
- “sr_connected”
- Promotion rules use those attributes
Pacemaker’s promotion decision is constrained by:- SR state must be safe
- The node must have quorum
- Fencing must be guaranteed
If the secondary is not in a safe synchronization state, the resource agent reports that promotion is unsafe. Pacemaker then blocks automatic takeover. This implements the safety rule you described in your document: no automatic failover if out of sync.
Important architectural detail
Pacemaker does not calculate LSN differences or compare timestamps itself. It trusts the HANA kernel’s SR state evaluation. That is why:
- If monitoring paths break,
- If HostAgent or SR interfaces are unreachable,
- Or if quorum/fencing is degraded,
the cluster may refuse promotion even if replication looks fine from a DBA perspective. This separation of responsibilities is intentional:
- HANA guarantees data correctness.
- Pacemaker guarantees single-primary enforcement and controlled promotion.
- The resource agent bridges both worlds.
What “safe” actually means in practice
For automatic promotion to be allowed, typically all of the following must be true:
- Secondary is connected to primary
- Log replay is running
- No backlog beyond allowed thresholds
- No replication errors
- Cluster quorum is intact
- Fencing is operational
If any of those checks fail, Pacemaker will refuse automatic promotion even if the primary crashes.
Failover Scenarios
Scenario-by-scenario – Overview

# | Failure scenario | Typical detection | Pacemaker action | Auto failover possible? | Main risk / note |
1 | HW failure primary, cannot reboot | Node disappears from Corosync | Fence (if needed), promote secondary, move VIP | Yes if secondary is “safe” | If SR not safe, promotion blocked |
2 | HW failure primary, reboot possible | Node loss then returns | Usually treat as failure: fence/ban, failover, keep old as secondary after repair | Yes if safe | Avoid flapping causing dual primary |
3 | OS crash primary, reboot fails | Node loss | Same as #1 | Yes if safe | Same as #1 |
4 | OS crash primary, reboot possible | Node loss then returns | Same as #2 | Yes if safe | Must prevent old primary auto-start outside cluster |
5 | Storage failure primary (disk unreadable) | HANA health/resource monitor fails, FS agent may detect | Stop/fence primary if unsafe, promote secondary if safe | Sometimes | If data corruption suspected, avoid takeover |
6 | Network failure: primary site unreachable (from HA site) | Link loss, Corosync partition | Fence side that lost quorum per design, then promote if safe | Sometimes | Split brain risk drives fencing decision |
7 | Network failure: HA site unreachable (from primary) | Same as #6 | Usually keep primary running, fence unreachable node if needed | No failover needed | Ensure secondary does not promote |
8 | OS shutdown triggered on primary | Clean stop detected (systemd/monitor) | Controlled stop, promote secondary if policy allows | Sometimes | Usually treated like planned switchover |
9 | HANA crash primary (crash dump) | HANA monitor fails | Restart locally if allowed; if repeated or blocked then failover | Sometimes | Depends on restart policy and SR state |
10 | HANA restart triggered (SIGTERM/SIGKILL) | Instance stops | Restart resource; if not stable then failover | Sometimes | Avoid promoting if SR not safe |
11 | HANA on HA site not responding | Secondary monitor fails | Keep primary running, attempt recovery of secondary | No | Availability ok, redundancy degraded |
12 | SR out of sync (log replay behind) | SR status via agents/hooks | Block takeover, alert, optionally “freeze” | No (by design) | Your “no auto failover if out of sync” rule |
13 | Split brain at HANA level (both primary) | SR state inconsistent + cluster policy | Immediate fencing of one side, then re-establish SR | No (automatic promotion is unsafe) | Data divergence risk |
14 | Pacemaker/fencing/quorum failure while HANA healthy | Cluster health checks fail | If cluster cannot guarantee safety, it may stop resources or refuse actions | No | “Cluster blind” scenario; protect data |
15 | Time sync failure | NTP/chrony drift alarms (often external) | Usually no immediate takeover; may stop cluster actions if policy | No (generally) | TLS/Kerberos, HANA, and cluster timing can break |
16 | SAP HostAgent failure | HostAgent monitor fails | Usually alert only; cluster may still run HANA | Yes/No depends | HostAgent impacts ops tooling, not always DB |
Data Loss Risk Overview
Legend:
- 🟢 No data loss risk
- 🟡 Small RPO possible (async lag dependent)
- 🔴 High / structural data loss risk
- 🔴🔴 Divergence risk (worse than partial loss)
# | Failure Scenario | Auto Failover Typical | Data Loss Risk (async) | Data Loss Risk (sync) | Explanation (RPO View) |
1 | HW failure primary, no reboot | Yes (if safe) | 🟡 | 🟢 | Small RPO possible in async mode (redo not yet shipped/replayed) |
2 | HW failure primary, reboot possible | Yes (if safe) | 🟡 | 🟢 | Same as #1 |
3 | OS crash primary, reboot fails | Yes (if safe) | 🟡 | 🟢 | Same mechanics as hardware crash |
4 | OS crash primary, reboot possible | Yes (if safe) | 🟡 | 🟢 | Same as #3 |
5 | Storage failure primary | Sometimes | 🔴 | 🔴 | Possible partial loss + potential corruption risk |
6 | Network partition – primary unreachable | Sometimes | 🔴 | 🔴 | High risk if fencing fails or takeover forced |
7 | Network partition – HA unreachable | No failover | 🟢 | 🟢 | Primary continues; no takeover |
8 | OS shutdown primary | Sometimes | 🟢 / 🟡 | 🟢 / 🟡 | Safe if planned; small RPO if unplanned |
9 | HANA crash primary | Sometimes | 🟡 | 🟢 | Same as crash scenarios |
10 | HANA restart triggered | Sometimes | 🟡 | 🟢 | Small RPO possible if takeover before replay |
11 | Secondary not responding | No | 🟢 | 🟢 | No promotion; redundancy degraded only |
12 | SR out of sync | No (blocked) | 🔴 | 🔴 | Forced takeover causes guaranteed data loss |
13 | Split brain | No | 🔴🔴 | 🔴🔴 | Data divergence; one side must be discarded |
14 | Cluster / fencing / quorum failure | No | 🟢 | 🟢 | Cluster blocks unsafe actions |
15 | Time sync failure | No | 🟢 | 🟢 | No direct replication impact |
16 | SAP Host Agent failure | Depends | 🟢 | 🟢 | Operational impact only |
If running:
- sync or syncmem mode
→ Scenarios 1–4, 9, 10 typically move from 🟡 to 🟢 (RPO = 0) - async (performance-optimized)
→ Small crash-window exposure always exists - forced takeover while out of sync
→ Explicitly accepted data loss (scenario 12) 🔴 - fencing failure in partition
→ Divergence possible (scenario 6 → potentially 13) 🔴
1 Hardware failure primary site (instant crash, reboot not possible)
Detect
- Corosync membership loss (node gone)
- Resource monitors fail
Sync evaluation
- Secondary SR state checked. If it reports a safe replication state (typical “in sync” condition), takeover can proceed.
Pacemaker action
- Fence the dead node if required by policy (sometimes “can’t fence” still counts as failed fencing and blocks, depends on setup)
- Promote secondary
- Move VIP to new primary
Auto failover
- Yes, if secondary is safe.
Risk of Data Loss
- Possible in performance-optimized (async) mode if redo logs were committed on the primary but not yet shipped/replayed.
- No loss expected in fully synchronized (sync/syncmem) mode at crash time.
- Exposure limited to the replication lag window (small RPO)
2 Hardware failure primary (crash, reboot possible)
Detect
- Same as #1, plus the node may come back later.
Pacemaker action
- Usually failover same as #1.
- When the old primary returns, it must not come up as primary outside cluster control.
- After repair, the old primary is reintroduced as secondary and SR is re-registered.
Auto failover
Risk of Data Loss
- Same technical exposure as #1.
- Reboot possibility does not change replication risk.
- Small RPO possible in async mode
3 OS crash primary (reboot fails)
Detect
Sync evaluation
- Secondary SR state checked.
Pacemaker action
Auto failover
Risk of Data Loss
- Same mechanics as hardware crash.
- Possible small RPO in async mode.
- No loss if fully synchronized at crash time.
4 OS crash primary (reboot possible)
Detect
Pacemaker action
- Same as #2.
- Must prevent old primary auto-start outside cluster control.
Auto failover
Risk of Data Loss
- Same as #3.
- Limited to replication lag window if async mode is used.
5 Storage failure primary site (one or more disks not readable)
Detect
- HANA monitor reports failure (I/O errors, services stop)
- Optional filesystem agents detect outages
Sync evaluation
- Secondary might still be in sync, but corruption risk must be considered.
Pacemaker action
- Stop or fence primary if unsafe
- Promote secondary only if SR state is safe
- If corruption suspected or SR unsafe, block promotion
Auto failover
Risk of Data Loss
- Elevated risk.
- Possible partial data loss if redo not shipped.
- Additional risk of logical corruption prior to crash.
- In worst cases, restore from backup may be required.
6 Network failure, Primary Site not reachable (from HA site)
This is the classic partition case. In a standard two-node cluster with two_node=1, neither side technically "loses quorum" in the traditional sense; rather, the tie-breaking is handled by fencing. The side that survives the fencing race is the one that promotes.
Detect
Sync evaluation
- HA side must not assume takeover without fencing certainty.
Pacemaker action
- Fence the other side before promotion.
- Fencing-first behavior critical.
Auto failover
- Sometimes, only if fencing succeeds and SR state allows.
Risk of Data Loss
- High risk if fencing fails or is bypassed.
- Potential partial loss if takeover occurs without last logs.
- Severe risk of data divergence if split brain develops.
7 Network failure, HA Site not reachable (from primary)
Detect
- Same partition, seen from primary side.
Pacemaker action
- Primary keeps running.
- Prevent secondary from promoting.
Auto failover
Risk of Data Loss
- No immediate data loss risk.
- Redundancy degraded only.
8 OS shutdown triggered on primary site
Typically, a clean OS shutdown without a pre-orchestrated resource move or standby command might be seen by the cluster as a failure rather than a "clean" move unless specifically configured.
Detect
- Clean stop observed by monitors.
Sync evaluation
- Normally treated as controlled switchover.
Pacemaker action
- Move VIP and promote secondary before shutdown if planned.
Auto failover
Risk of Data Loss
- No loss if executed as planned switchover with verified sync.
- Possible small RPO if treated as failure while not fully synchronized.
9 HANA crash on primary (crash dump)
Detect
- SAPHana* monitor detects instance down.
Sync evaluation
- Secondary SR status checked.
Pacemaker action
- Attempt local restart first.
- If restart fails, takeover if safe.
Auto failover
Risk of Data Loss
- Same exposure as crash scenarios.
- Small RPO possible in async mode.
- No loss if replay fully caught up.
10 HANA restart triggered on primary (SIGTERM, SIGKILL)
Operationally similar to #9, except it might be “cleaner” (SIGTERM) or abrupt (SIGKILL). Pacemaker will try restart; if not stable, it may fail over if safe.
Detect
Pacemaker action
- Restart resource.
- If unstable, failover if safe.
Auto failover
Risk of Data Loss
- Possible if takeover occurs before replay completion.
- Clean SIGTERM with synchronized SR typically safe.
11 HANA on HA site not responding
Detect
Sync evaluation
- No takeover because primary is up.
Pacemaker action
- Keep primary running.
- Attempt recovery of secondary.
Auto failover
Risk of Data Loss
- No data loss.
- Availability intact.
- Redundancy temporarily lost.
12 Primary/HA out of sync (log replay not working or falling behind)
Detect
- SR status indicates not healthy/synchronized.
Pacemaker action
- Block automatic takeover.
- Alert operators.
Auto failover
Risk of Data Loss
- Guaranteed data loss if manual forced takeover is executed.
- Data loss equals replication lag at time of takeover.
13 Split brain at HANA level (both think they are primary)
Detect
- Inconsistent SR state.
- Impossible role combination.
Pacemaker action
- Immediate fencing of one side.
- Manual recovery and re-establishment of SR.
Auto failover
Risk of Data Loss
- Very high.
- Data divergence possible.
- One side’s committed transactions must be discarded.
14 Pacemaker/Fencing device/Quorum failure while HANA is healthy
Detect
- Cluster services degraded.
- STONITH errors.
- quorum subsystem issues
Pacemaker action
- Refuse dangerous actions.
- Keep current primary running if possible.
Auto failover
- No (in most sane policies).
Risk of Data Loss
- No immediate risk.
- Risk only arises if administrators force unsafe manual takeover.
15 Time synchronization failure
Detect
Pacemaker action
- Usually no immediate takeover.
- May block cluster actions if severe.
Auto failover
Risk of Data Loss
- No direct replication loss.
- Indirect operational instability possible.
16 SAP HostAgent failure
Detect
- Host Agent monitor fails.
Pacemaker action
- Usually alert only.
- HANA continues running.
Auto failover
Risk of Data Loss
- No direct data loss.
- Impacts operational tooling rather than replication.
Summary

Failover sequence (happy path, safe SR)
- Primary fails
- Cluster detects node/resource failure
- Cluster fences old primary (or confirms dead)
- Cluster promotes secondary to primary
- VIP moves to new primary
- ABAP app servers reconnect to same endpoint (VIP)
Where automatic failover is possible vs not
Automatic failover typically possible (if SR is safe and fencing works)
- 1, 2, 3, 4, 16
- 6 (only if fencing succeeds and policy allows)
- 8, 9, 10 (depends on restart policy and SR safety)
Automatic failover typically NOT possible (blocked by safety)
- 12 out of sync (your explicit rule)
- 13 split brain (must fence and manually recover)
- 14 cluster safety infrastructure broken (quorum/fencing)
- Often 5 if corruption risk exists
Usually no failover needed (service continues)
- 7 standby unreachable (primary still serves)
- 11 standby HANA not responding (redundancy degraded)
Manual failover required: when and steps
When manual failover is required (common cases)
- 12 out-of-sync SR but primary is dead or unstable and business needs recovery
- 5 storage failure where integrity is uncertain
- 13 split brain recovery
- 14 fencing/quorum broken but you must move service
Manual failover steps (safe generic runbook)
- Stop business access to DB endpoint (freeze app tier if possible).
- Confirm primary is truly down or isolated.
- Fence old primary (or otherwise guarantee it cannot run HANA primary anymore).
- On the secondary, validate HANA SR status and decide:
- If acceptable, execute HANA takeover (accepting possible data loss if not in sync).
- In Pacemaker, ensure the promotable resource is promoted on the intended node and VIP follows.
- Re-register the old site as secondary after it is repaired, then resync.
(Exact commands differ by deployment tooling, but the safety sequence above is the key.)
Scenarios where neither auto nor manual failover is possible
These are rare, but real:
- Both sites down, or secondary storage also broken
- Replication is broken and the last consistent copy is unknown
- Severe split brain with uncertain data consistency and no trustworthy primary
In these cases, failover is not the solution.
When restore from backup is required
Most commonly:
- Primary storage failure plus suspected corruption and secondary is not a trustworthy consistent target (5 + unsafe SR)
- Split brain with data divergence where you cannot guarantee correctness (13)
- Prolonged out-of-sync where logs are missing and secondary cannot be promoted without unacceptable loss (12 in worst-case)
When Pacemaker cannot reliably detect “in sync”
Pacemaker can only act on what the SR interfaces and monitoring provide. It can lose certainty when:
- Monitoring path is broken (cluster can’t query SR state)
- Time sync problems cause false timeouts (15)
- HostAgent or SAP control interfaces used by the agent are unavailable (16, depending on setup)
- Quorum/STONITH is degraded (14), so even if SR looks good, the cluster cannot guarantee single-primary enforcement
What then?
- Safe designs default to blocking promotion until:
- fencing certainty is restored, and
- SR health can be validated again
Documentation
Further Links and documentation are listed below:
SAP Notes
SAP Internal