Introduction

christoph_weyd · ‎2026 Feb 12

Introduction

Pacemaker’s role is to automate safe failover. It continuously monitors node health, HANA process health, and the replication state between sites. If the primary becomes unavailable, the cluster can promote the standby to primary and move the client access endpoint (typically a virtual IP or load balancer address) so applications reconnect without configuration changes. To avoid the most dangerous failure mode, dual-primary or “split brain,” the design depends on reliable fencing (STONITH) and a quorum strategy that ensures the cluster only makes promotion decisions when it can guarantee that the old primary is truly stopped.

This document describes high availability for SAP HANA using Pacemaker on Linux, focusing on a two‑site, active/passive database design with system replication and automatic failover. It explains how Pacemaker, Corosync, fencing mechanisms, and HANA system replication work together to provide fast recovery, protect against split‑brain, and ensure that failover only occurs when data is safely in sync.

Content Overview

Introduction
Content Overview
Key Terms
SAP ECS (RISE) reference architecture
HANA log shipping and log replay
Failover Scenarios
Summary
Documentation

Key Terms

Fencing (STONITH)

Fencing means forcibly removing a node from service so it cannot access shared resources or continue serving clients (classic “Shoot The Other Node In The Head”). In HANA SR clusters, fencing is primarily about preventing two primaries (data corruption risk).

Common fencing methods:

SBD/watchdog (very common on SLES for SAP): watchdog-triggered self-fence, optionally backed by shared/diskless SBD. SUSE explicitly stresses watchdog usage for SBD reliability.
IPMI/iLO, cloud fence agents, etc. SUSE provides IPMI examples.

Promotion

Pacemaker models “who is primary” as a promotable resource (historically “master/slave”). Promotion is the action: secondary is promoted to primary (takeover).

Failover

Clients (ABAP, tools, interfaces) connect to a VIP (or a cloud load balancer fronting an IP). Pacemaker moves the VIP to the node that currently hosts the primary HANA, so applications do not need to change connection endpoints. SUSE’s guide shows adding the VIP as an IPaddr2 resource and notes extra settings in public cloud.

Quorum Design

Quorum is “can this cluster make decisions safely.” Two-node clusters are special:

Many setups use two_node=1 and a fencing-first policy: you can run with 2 nodes, but loss of communication must result in fencing to avoid split brain. SUSE shows the corosync votequorum two-node configuration.
Some environments add a qdevice / “majority maker” in a third location (common for strict quorum semantics), especially for stretched scenarios.

SAP ECS (RISE) reference architecture

The exact details depend on the contractual SLA's.

DB layer

Site A: HANA primary
Site B: HANA secondary (replication target)
HANA SR runs in performance-optimized style in many HA designs: one primary, one hot standby secondary continuously receiving logs (log shipping + log replay).

Cluster layer (Pacemaker + Corosync)

Two nodes (one per site) form a Pacemaker cluster with Corosync membership and votequorum.
Cluster manages:

Start/stop of HANA
Promotion logic
Placement of VIP
“Do not fail over if unsafe” rules

On SLES-for-SAP, SUSE provides dedicated SAP HANA SR resource agents and a detailed end-to-end guide.
On RHEL for SAP Solutions, Red Hat provides SAP HANA SR automation guidance and the SAP HANA-specific agents via its SAP Solutions repositories.

Fencing layer

Mandatory. If Pacemaker cannot guarantee only one side is active primary, it fences.
SUSE documents SBD/watchdog requirements and alternative STONITH methods (SBD deterministic delays, IPMI examples).
Cloud patterns (for context) often use provider fencing plus VIP via internal load balancers.

Quorum strategy

Often two-node votequorum mode plus fencing-first behavior. SUSE shows two_node=1 configuration.
Some RISE landscapes use an additional quorum mechanism (3rd site/qdevice) depending on strictness and infrastructure.

VIP / client access

VIP is controlled by Pacemaker, tied by constraints to the promoted primary. SUSE shows VIP resource configuration and constraints.

Application layer (ABAP app servers in both sites)

App servers in both sites are active and typically use:

DB connect via VIP (or LB address)
Logon groups and message server for user distribution

After a DB failover, app servers reconnect to the same endpoint, but reach the new primary.

Why SAP ECS uses this setup

Fast RTO: automate takeover and VIP move.
Data integrity: “fail over only if safe” and fence to prevent dual primary.
Operational consistency: standardized patterns across customers, OS-vendor-supported tooling (SUSE best practices, Red Hat SAP Solutions guidance).

RHEL vs SLES differences

Common core: Pacemaker + Corosync + STONITH + VIP + HANA SR agents exist on both.

Key differences you will notice in real projects:

SLES for SAP

Strong “SAPHanaSR” ecosystem and the newer SAPHanaSR-angi, recommended for new deployments on SLES-for-SAP 15 SP4+ and later.
SBD/watchdog is very commonly used and heavily documented.
YaST tooling exists for cluster setup in SLES-for-SAP (convenience, not a requirement).

RHEL for SAP Solutions

SAP HANA SR automation is delivered via Red Hat’s SAP Solutions content and packages, and Red Hat documents extra considerations (for example systemd-based SAP startup interactions).

HANA log shipping and log replay

How HANA log shipping and log replay work

In a two-site HANA System Replication setup (performance-optimized), replication is based on redo logs, not full data pages.

Step-by-step process:

Transaction commit on primary
- A user transaction changes data in memory.
- The change is written to the HANA redo log buffer.
- On commit, the redo log is flushed to the primary’s log volume.
Log shipping
- The primary continuously streams redo log entries to the secondary over the SR channel.
- This happens at kernel level, not via SQL or file copy.
- Shipping can be synchronous or asynchronous depending on configuration.
Log replay on secondary
- The secondary receives redo entries.
- It writes them to its own log volume.
- The redo entries are then replayed into the secondary’s memory structures.
- As long as replay keeps up, the secondary is transactionally consistent with the primary.

The important distinction:

Log shipped means the redo arrived.
Log replayed means the redo has been applied and the in-memory database state matches the primary.

In performance-optimized async mode, there can be a small delay between primary commit and replay on the secondary. That delay defines your RPO exposure.

How Pacemaker “knows” everything has been replicated

Pacemaker does not inspect database blocks or compare data pages. It relies entirely on HANA System Replication state interfaces exposed by HANA and interpreted by the HANA SR resource agents (for example SAPHanaSR / SAPHanaSR-angi on SLES or the Red Hat equivalents).

The logic works like this:

Resource agent queries HANA SR status
- The agent regularly calls HANA internal interfaces.
- It checks:
  - Role (primary or secondary)
  - Replication mode
  - Connection status
  - Replay status
  - Synchronization state

HANA classifies replication safety
HANA internally classifies the replication state into categories such as:
- Active / connected
- Synchronized / in sync
- Catch-up
- Error / disconnected
Agent maps SR state to cluster attributes
The resource agent translates the SR state into cluster attributes like:
- “can_promote”
- “sync_state”
- “sr_connected”
Promotion rules use those attributes
Pacemaker’s promotion decision is constrained by:
- SR state must be safe
- The node must have quorum
- Fencing must be guaranteed

If the secondary is not in a safe synchronization state, the resource agent reports that promotion is unsafe. Pacemaker then blocks automatic takeover. This implements the safety rule you described in your document: no automatic failover if out of sync.

Important architectural detail

Pacemaker does not calculate LSN differences or compare timestamps itself. It trusts the HANA kernel’s SR state evaluation. That is why:

If monitoring paths break,
If HostAgent or SR interfaces are unreachable,
Or if quorum/fencing is degraded,

the cluster may refuse promotion even if replication looks fine from a DBA perspective. This separation of responsibilities is intentional:

HANA guarantees data correctness.
Pacemaker guarantees single-primary enforcement and controlled promotion.
The resource agent bridges both worlds.

What “safe” actually means in practice

For automatic promotion to be allowed, typically all of the following must be true:

Secondary is connected to primary
Log replay is running
No backlog beyond allowed thresholds
No replication errors
Cluster quorum is intact
Fencing is operational

If any of those checks fail, Pacemaker will refuse automatic promotion even if the primary crashes.

Failover Scenarios

Scenario-by-scenario – Overview

#	Failure scenario	Typical detection	Pacemaker action	Auto failover possible?	Main risk / note
1	HW failure primary, cannot reboot	Node disappears from Corosync	Fence (if needed), promote secondary, move VIP	Yes if secondary is “safe”	If SR not safe, promotion blocked
2	HW failure primary, reboot possible	Node loss then returns	Usually treat as failure: fence/ban, failover, keep old as secondary after repair	Yes if safe	Avoid flapping causing dual primary
3	OS crash primary, reboot fails	Node loss	Same as #1	Yes if safe	Same as #1
4	OS crash primary, reboot possible	Node loss then returns	Same as #2	Yes if safe	Must prevent old primary auto-start outside cluster
5	Storage failure primary (disk unreadable)	HANA health/resource monitor fails, FS agent may detect	Stop/fence primary if unsafe, promote secondary if safe	Sometimes	If data corruption suspected, avoid takeover
6	Network failure: primary site unreachable (from HA site)	Link loss, Corosync partition	Fence side that lost quorum per design, then promote if safe	Sometimes	Split brain risk drives fencing decision
7	Network failure: HA site unreachable (from primary)	Same as #6	Usually keep primary running, fence unreachable node if needed	No failover needed	Ensure secondary does not promote
8	OS shutdown triggered on primary	Clean stop detected (systemd/monitor)	Controlled stop, promote secondary if policy allows	Sometimes	Usually treated like planned switchover
9	HANA crash primary (crash dump)	HANA monitor fails	Restart locally if allowed; if repeated or blocked then failover	Sometimes	Depends on restart policy and SR state
10	HANA restart triggered (SIGTERM/SIGKILL)	Instance stops	Restart resource; if not stable then failover	Sometimes	Avoid promoting if SR not safe
11	HANA on HA site not responding	Secondary monitor fails	Keep primary running, attempt recovery of secondary	No	Availability ok, redundancy degraded
12	SR out of sync (log replay behind)	SR status via agents/hooks	Block takeover, alert, optionally “freeze”	No (by design)	Your “no auto failover if out of sync” rule
13	Split brain at HANA level (both primary)	SR state inconsistent + cluster policy	Immediate fencing of one side, then re-establish SR	No (automatic promotion is unsafe)	Data divergence risk
14	Pacemaker/fencing/quorum failure while HANA healthy	Cluster health checks fail	If cluster cannot guarantee safety, it may stop resources or refuse actions	No	“Cluster blind” scenario; protect data
15	Time sync failure	NTP/chrony drift alarms (often external)	Usually no immediate takeover; may stop cluster actions if policy	No (generally)	TLS/Kerberos, HANA, and cluster timing can break
16	SAP HostAgent failure	HostAgent monitor fails	Usually alert only; cluster may still run HANA	Yes/No depends	HostAgent impacts ops tooling, not always DB

Data Loss Risk Overview

Legend:

🟢 No data loss risk
🟡 Small RPO possible (async lag dependent)
🔴 High / structural data loss risk
🔴🔴 Divergence risk (worse than partial loss)

#	Failure Scenario	Auto Failover Typical	Data Loss Risk (async)	Data Loss Risk (sync)	Explanation (RPO View)
1	HW failure primary, no reboot	Yes (if safe)	🟡	🟢	Small RPO possible in async mode (redo not yet shipped/replayed)
2	HW failure primary, reboot possible	Yes (if safe)	🟡	🟢	Same as #1
3	OS crash primary, reboot fails	Yes (if safe)	🟡	🟢	Same mechanics as hardware crash
4	OS crash primary, reboot possible	Yes (if safe)	🟡	🟢	Same as #3
5	Storage failure primary	Sometimes	🔴	🔴	Possible partial loss + potential corruption risk
6	Network partition – primary unreachable	Sometimes	🔴	🔴	High risk if fencing fails or takeover forced
7	Network partition – HA unreachable	No failover	🟢	🟢	Primary continues; no takeover
8	OS shutdown primary	Sometimes	🟢 / 🟡	🟢 / 🟡	Safe if planned; small RPO if unplanned
9	HANA crash primary	Sometimes	🟡	🟢	Same as crash scenarios
10	HANA restart triggered	Sometimes	🟡	🟢	Small RPO possible if takeover before replay
11	Secondary not responding	No	🟢	🟢	No promotion; redundancy degraded only
12	SR out of sync	No (blocked)	🔴	🔴	Forced takeover causes guaranteed data loss
13	Split brain	No	🔴🔴	🔴🔴	Data divergence; one side must be discarded
14	Cluster / fencing / quorum failure	No	🟢	🟢	Cluster blocks unsafe actions
15	Time sync failure	No	🟢	🟢	No direct replication impact
16	SAP Host Agent failure	Depends	🟢	🟢	Operational impact only

If running:

sync or syncmem mode
→ Scenarios 1–4, 9, 10 typically move from 🟡 to 🟢 (RPO = 0)
async (performance-optimized)
→ Small crash-window exposure always exists
forced takeover while out of sync
→ Explicitly accepted data loss (scenario 12) 🔴
fencing failure in partition
→ Divergence possible (scenario 6 → potentially 13) 🔴

1 Hardware failure primary site (instant crash, reboot not possible)

Detect

Corosync membership loss (node gone)
Resource monitors fail

Sync evaluation

Secondary SR state checked. If it reports a safe replication state (typical “in sync” condition), takeover can proceed.

Pacemaker action

Fence the dead node if required by policy (sometimes “can’t fence” still counts as failed fencing and blocks, depends on setup)
Promote secondary
Move VIP to new primary

Auto failover

Yes, if secondary is safe.

Risk of Data Loss

Possible in performance-optimized (async) mode if redo logs were committed on the primary but not yet shipped/replayed.
No loss expected in fully synchronized (sync/syncmem) mode at crash time.
Exposure limited to the replication lag window (small RPO)

2 Hardware failure primary (crash, reboot possible)

Detect

Same as #1, plus the node may come back later.

Pacemaker action

Usually failover same as #1.
When the old primary returns, it must not come up as primary outside cluster control.
After repair, the old primary is reintroduced as secondary and SR is re-registered.

Auto failover

Yes, if safe.

Risk of Data Loss

Same technical exposure as #1.
Reboot possibility does not change replication risk.
Small RPO possible in async mode

3 OS crash primary (reboot fails)

Detect

Node loss.

Sync evaluation

Secondary SR state checked.

Pacemaker action

Same as #1.

Auto failover

Yes, if safe.

Risk of Data Loss

Same mechanics as hardware crash.
Possible small RPO in async mode.
No loss if fully synchronized at crash time.

4 OS crash primary (reboot possible)

Detect

Node loss then returns.

Pacemaker action

Same as #2.
Must prevent old primary auto-start outside cluster control.

Auto failover

Yes, if safe.

Risk of Data Loss

Same as #3.
Limited to replication lag window if async mode is used.

5 Storage failure primary site (one or more disks not readable)

Detect

HANA monitor reports failure (I/O errors, services stop)
Optional filesystem agents detect outages

Sync evaluation

Secondary might still be in sync, but corruption risk must be considered.

Pacemaker action

Stop or fence primary if unsafe
Promote secondary only if SR state is safe
If corruption suspected or SR unsafe, block promotion

Auto failover

Sometimes.

Risk of Data Loss

Elevated risk.
Possible partial data loss if redo not shipped.
Additional risk of logical corruption prior to crash.
In worst cases, restore from backup may be required.

6 Network failure, Primary Site not reachable (from HA site)

This is the classic partition case. In a standard two-node cluster with two_node=1, neither side technically "loses quorum" in the traditional sense; rather, the tie-breaking is handled by fencing. The side that survives the fencing race is the one that promotes.

Detect

Corosync partition.

Sync evaluation

HA side must not assume takeover without fencing certainty.

Pacemaker action

Fence the other side before promotion.
Fencing-first behavior critical.

Auto failover

Sometimes, only if fencing succeeds and SR state allows.

Risk of Data Loss

High risk if fencing fails or is bypassed.
Potential partial loss if takeover occurs without last logs.
Severe risk of data divergence if split brain develops.

7 Network failure, HA Site not reachable (from primary)

Detect

Same partition, seen from primary side.

Pacemaker action

Primary keeps running.
Prevent secondary from promoting.

Auto failover

Not applicable.

Risk of Data Loss

No immediate data loss risk.
Redundancy degraded only.

8 OS shutdown triggered on primary site

Typically, a clean OS shutdown without a pre-orchestrated resource move or standby command might be seen by the cluster as a failure rather than a "clean" move unless specifically configured.

Detect

Clean stop observed by monitors.

Sync evaluation

Normally treated as controlled switchover.

Pacemaker action

Move VIP and promote secondary before shutdown if planned.

Auto failover

Sometimes.

Risk of Data Loss

No loss if executed as planned switchover with verified sync.
Possible small RPO if treated as failure while not fully synchronized.

9 HANA crash on primary (crash dump)

Detect

SAPHana* monitor detects instance down.

Sync evaluation

Secondary SR status checked.

Pacemaker action

Attempt local restart first.
If restart fails, takeover if safe.

Auto failover

Sometimes.

Risk of Data Loss

Same exposure as crash scenarios.
Small RPO possible in async mode.
No loss if replay fully caught up.

10 HANA restart triggered on primary (SIGTERM, SIGKILL)

Operationally similar to #9, except it might be “cleaner” (SIGTERM) or abrupt (SIGKILL). Pacemaker will try restart; if not stable, it may fail over if safe.

Detect

Instance stops.

Pacemaker action

Restart resource.
If unstable, failover if safe.

Auto failover

Sometimes.

Risk of Data Loss

Possible if takeover occurs before replay completion.
Clean SIGTERM with synchronized SR typically safe.

11 HANA on HA site not responding

Detect

Secondary monitor fails.

Sync evaluation

No takeover because primary is up.

Pacemaker action

Keep primary running.
Attempt recovery of secondary.

Auto failover

No.

Risk of Data Loss

No data loss.
Availability intact.
Redundancy temporarily lost.

12 Primary/HA out of sync (log replay not working or falling behind)

Detect

SR status indicates not healthy/synchronized.

Pacemaker action

Block automatic takeover.
Alert operators.

Auto failover

No, by design.

Risk of Data Loss

Guaranteed data loss if manual forced takeover is executed.
Data loss equals replication lag at time of takeover.

13 Split brain at HANA level (both think they are primary)

Detect

Inconsistent SR state.
Impossible role combination.

Pacemaker action

Immediate fencing of one side.
Manual recovery and re-establishment of SR.

Auto failover

No.

Risk of Data Loss

Very high.
Data divergence possible.
One side’s committed transactions must be discarded.

14 Pacemaker/Fencing device/Quorum failure while HANA is healthy

Detect

Cluster services degraded.
STONITH errors.
quorum subsystem issues

Pacemaker action

Refuse dangerous actions.
Keep current primary running if possible.

Auto failover

No (in most sane policies).

Risk of Data Loss

No immediate risk.
Risk only arises if administrators force unsafe manual takeover.

15 Time synchronization failure

Detect

Chrony/NTP alarms.

Pacemaker action

Usually no immediate takeover.
May block cluster actions if severe.

Auto failover

Generally no.

Risk of Data Loss

No direct replication loss.
Indirect operational instability possible.

16 SAP HostAgent failure

Detect

Host Agent monitor fails.

Pacemaker action

Usually alert only.
HANA continues running.

Auto failover

Depends on design.

Risk of Data Loss

No direct data loss.
Impacts operational tooling rather than replication.

Summary

Failover sequence (happy path, safe SR)

Primary fails
Cluster detects node/resource failure
Cluster fences old primary (or confirms dead)
Cluster promotes secondary to primary
VIP moves to new primary
ABAP app servers reconnect to same endpoint (VIP)

Where automatic failover is possible vs not

Automatic failover typically possible (if SR is safe and fencing works)

1, 2, 3, 4, 16
6 (only if fencing succeeds and policy allows)
8, 9, 10 (depends on restart policy and SR safety)

Automatic failover typically NOT possible (blocked by safety)

12 out of sync (your explicit rule)
13 split brain (must fence and manually recover)
14 cluster safety infrastructure broken (quorum/fencing)
Often 5 if corruption risk exists

Usually no failover needed (service continues)

7 standby unreachable (primary still serves)
11 standby HANA not responding (redundancy degraded)

Manual failover required: when and steps

When manual failover is required (common cases)

12 out-of-sync SR but primary is dead or unstable and business needs recovery
5 storage failure where integrity is uncertain
13 split brain recovery
14 fencing/quorum broken but you must move service

Manual failover steps (safe generic runbook)

Stop business access to DB endpoint (freeze app tier if possible).
Confirm primary is truly down or isolated.
Fence old primary (or otherwise guarantee it cannot run HANA primary anymore).
On the secondary, validate HANA SR status and decide:
- If acceptable, execute HANA takeover (accepting possible data loss if not in sync).
In Pacemaker, ensure the promotable resource is promoted on the intended node and VIP follows.
Re-register the old site as secondary after it is repaired, then resync.

(Exact commands differ by deployment tooling, but the safety sequence above is the key.)

Scenarios where neither auto nor manual failover is possible

These are rare, but real:

Both sites down, or secondary storage also broken
Replication is broken and the last consistent copy is unknown
Severe split brain with uncertain data consistency and no trustworthy primary

In these cases, failover is not the solution.

When restore from backup is required

Most commonly:

Primary storage failure plus suspected corruption and secondary is not a trustworthy consistent target (5 + unsafe SR)
Split brain with data divergence where you cannot guarantee correctness (13)
Prolonged out-of-sync where logs are missing and secondary cannot be promoted without unacceptable loss (12 in worst-case)

When Pacemaker cannot reliably detect “in sync”

Pacemaker can only act on what the SR interfaces and monitoring provide. It can lose certainty when:

Monitoring path is broken (cluster can’t query SR state)
Time sync problems cause false timeouts (15)
HostAgent or SAP control interfaces used by the agent are unavailable (16, depending on setup)
Quorum/STONITH is degraded (14), so even if SR looks good, the cluster cannot guarantee single-primary enforcement

What then?

Safe designs default to blocking promotion until:

fencing certainty is restored, and
SR health can be validated again

Documentation

Further Links and documentation are listed below:

SAP Notes

SAP Internal

https://wiki.one.int.sap/wiki/spaces/HECOPS/pages/3363531559/Pacemaker+in+ECS

High Availability with Pacemaker

Introduction

Content Overview

Key Terms

Fencing (STONITH)

Promotion

Failover

Quorum Design

SAP ECS (RISE) reference architecture

DB layer

Cluster layer (Pacemaker + Corosync)

Fencing layer

Quorum strategy

VIP / client access

Application layer (ABAP app servers in both sites)

RHEL vs SLES differences

HANA log shipping and log replay

How HANA log shipping and log replay work

How Pacemaker “knows” everything has been replicated

Important architectural detail

Failover Scenarios

Scenario-by-scenario – Overview

Data Loss Risk Overview

1 Hardware failure primary site (instant crash, reboot not possible)

2 Hardware failure primary (crash, reboot possible)

3 OS crash primary (reboot fails)

4 OS crash primary (reboot possible)

5 Storage failure primary site (one or more disks not readable)

6 Network failure, Primary Site not reachable (from HA site)

7 Network failure, HA Site not reachable (from primary)

8 OS shutdown triggered on primary site

9 HANA crash on primary (crash dump)

10 HANA restart triggered on primary (SIGTERM, SIGKILL)

11 HANA on HA site not responding

12 Primary/HA out of sync (log replay not working or falling behind)

13 Split brain at HANA level (both think they are primary)

14 Pacemaker/Fencing device/Quorum failure while HANA is healthy

15 Time synchronization failure

16 SAP HostAgent failure

Summary

Failover sequence (happy path, safe SR)

Where automatic failover is possible vs not

Manual failover required: when and steps

Scenarios where neither auto nor manual failover is possible

When restore from backup is required

When Pacemaker cannot reliably detect “in sync”

Documentation