Pages

RHEL Linux Pacemaker Cluster

Building a Production-Grade Apache High Availability Cluster on RHEL

High availability on Linux is not about “keeping a service running” — it’s about coordinated failure handling, data integrity protection, deterministic recovery, and predictable behavior under stress.

On RHEL, Pacemaker + Corosync form Red Hat’s supported HA stack. This article goes beyond basic setup and explains how the cluster actually works, how to tune it, and how to avoid the classic mistakes that cause split-brain or endless failover loops.

Architecture Overview
┌─────────────┐
│   Corosync  │  ← Cluster messaging, quorum, membership
│  (Totem)    │
└──────┬──────┘
       │
┌──────▼──────┐
│  Pacemaker  │  ← Resource orchestration & decision engine
│  (CRMd)     │
└──────┬──────┘
       │
┌──────▼──────┐
│  Resource   │  ← OCF / systemd agents
│   Agents    │
└─────────────┘

Responsibility
Corosync
  • Reliable multicast messaging
  • Node membership
  • Quorum calculation
  • Split-brain prevention
Pacemaker
  • Resource placement decisions
  • Failure scoring
  • Recovery orchestration
  • Fencing enforcement
Pacemaker does nothing without Corosync. Corosync does not manage resources.

Key Pacemaker Internals
  • Designated Coordinator (DC)
  • Exactly one DC per cluster
  • Elected dynamically
  • Owns the authoritative CIB
  • All cluster decisions flow through the DC
  • DC changes are normal — frequent DC flapping is not.
Cluster Information Base (CIB)
XML-based configuration and runtime state
Stored in memory, synced across nodes
Sections:
Configuration (resources, constraints)
Status (failures, node state)
Options (timeouts, quorum policy)

View raw CIB:
# pcs cluster cib
Edit safely:
# pcs cluster cib /tmp/cib.xml
# pcs cluster cib-push /tmp/cib.xml

Local Resource Manager (LRMd)
Runs on every node
Executes resource agent actions:
  • start
  • stop
  • monitor
  • promote/demote (for master/slave)
LRMd reports results back to CRMd → DC.

Corosync:
Totem Protocol
  • UDP multicast or unicast
  • Token-based membership
  • Ordered, reliable delivery
  • Heartbeat + failure detection
Ports (default):
UDP 5404–5405

Corosync Configuration File
The /etc/corosync/corosync.conf file defines the cluster's communication layer for Pacemaker. It uses the Totem protocol for reliable multicast messaging across nodes, with redundancy via multiple rings.

Complete Sample Configuration
For a 2-node Apache HA cluster (apache-cluster), use this tuned configuration. Save it identically on all nodes before starting the cluster.

totem {
    version: 2
    cluster_name: apache-cluster
    config_version: 2  # Increment on config changes
    secauth: on        # Enables authentication (recommended)
    # token: 1000       # Heartbeat timeout (ms); default 1000, increase for slow networks
    # consensus: 1200   # Quorum agreement timeout (ms); default 1200
    # join: 60          # Time to wait for new nodes (s)
    # max_messages: 20  # Messages per period
    interface {
        ringnumber: 0
        bindnetaddr: 192.168.1.0  # Network subnet (e.g., your cluster net)
        mcastaddr: 226.94.1.100   # Multicast address (unique per cluster)
        mcastport: 5405
        ttl: 1
    }
}

logging {
    fileline: off
    to_stderr: no
    to_logfile: yes
    to_syslog: yes
    logfile: /var/log/cluster/corosync.log
    debug: off
    timestamp: on
    logger_subsys {
        subsys: QUORUM
        debug: off
    }
}

nodelist {
    node {
        ring0_addr: node1.example.com  # FQDN or IP of node1
        nodeid: 1
        quorum_votes: 1
    }
    node {
        ring0_addr: node2.example.com  # FQDN or IP of node2
        nodeid: 2
        quorum_votes: 1
    }
}

quorum {
    provider: corosync_votequorum
    two_node: 1  # Enables 2-node operation (one vote each)
    # wait_for_all: on  # Wait for all nodes before quorum
}

amf {
    mode: disabled  # Application Management Framework (not needed for basic HA)
}

Key Parameters Explained
SectionParameterPurposeRecommended Value
totemsecauthEnables message signing/authenticationon (security)
totemcluster_nameUnique cluster identifierMatches pcs cluster name
totem.interfacebindnetaddrCluster network subnete.g., 192.168.1.0
totem.interfacemcastaddrMulticast group (per-cluster unique)226.94.1.100
nodelist.nodering0_addrNode's cluster IP/hostnameResolvable via /etc/hosts
nodelist.nodenodeidUnique numeric ID (1-N)Sequential integers
nodelist.nodequorum_votesVotes for quorum calculation1 per node
quorumtwo_nodeAllows 2-node clusters1 (critical for pairs)

Tuning note:
Lower token values = faster failover, but higher false-positive risk on noisy networks.

Quorum: The Non-Negotiable Rule
No quorum = no resources

Default behavior:
# pcs property get no-quorum-policy

Values:
stop (default)
freeze
ignore (only for 2-node with fencing or qdevice)

Why 2-Node Clusters Are Dangerous
50/50 split possible
Both nodes think the other is dead
Data corruption without fencing

Correct 2-Node Setup
Enable STONITH
Add qdevice (tie-breaker)

Fencing (STONITH): Why It’s Mandatory
If the cluster can’t power off a failed node, it cannot guarantee data integrity
Pacemaker will refuse to start resources without fencing in production-grade configs.

Fence Agent Categories
  • Power-based: IPMI, iDRAC, iLO
  • Hypervisor-based: fence_vmware, fence_rhevm
  • Network-based: fence_switch
Example (IPMI):
# pcs stonith create fence_node1 fence_ipmilan \
  pcmk_host_list=node1 \
  ipaddr=192.168.1.100 \
  login=admin passwd=secret \
  lanplus=1

Verify:
# pcs stonith show
# pcs stonith fence node1

Apache HA Design
Clients
   |
[ Virtual IP ]
   |
ApacheGroup
 ├── IPaddr2
 └── httpd

Why Grouping Matters
  • Ensures start/stop order
  • Guarantees co-location
  • Simplifies constraints
Resource Agent Types
TypeUse Case
ocf:heartbeatPortable, HA-aware
systemd:Systemd units
stonith:Fencing devices
Prefer OCF agents where possible — they expose richer monitoring.

Resource Creation (With Timeouts)
# pcs resource create VirtualIP ocf:heartbeat:IPaddr2 \
  ip=192.168.1.50 cidr_netmask=24 \
  op monitor interval=20s timeout=30s

# pcs resource create WebServer systemd:httpd \
  op start timeout=90s \
  op stop timeout=90s \
  op monitor interval=30s timeout=60s

Failure Handling & Scoring
Pacemaker assigns failure scores:
Resource fails → node score drops
Score below threshold → resource moved
View scores:
# pcs resource failcount show WebServer
Clear after fixing issue:
# pcs resource cleanup WebServer

Constraints
Colocation
# pcs constraint colocation add WebServer with VirtualIP INFINITY
Ordering
# pcs constraint order VirtualIP then WebServer
Location Bias
# pcs constraint location WebServer prefers node1=100
Negative scoring:
# pcs constraint location WebServer avoids node3=INFINITY

Monitoring & Debugging
Live Cluster View
# crm_mon -Arf

Pacemaker Logs
# journalctl -u pacemaker -f
# journalctl -u corosync -f

Common Debug Commands
# pcs status --full
# pcs resource debug-start WebServer
# pcs cluster health

Failover Testing
Service Failure
# systemctl stop httpd
Node Eviction
# pcs node standby node1
Fence Test
# pcs stonith fence node1

Performance & Failover Tuning
Setting                Effect
monitor interval         Detection speed
token                   Corosync sensitivity
op timeout            Avoid false failures
failure-timeout         Auto recovery

Example tuning:
# pcs resource op defaults timeout=90s
# pcs property set failure-timeout=120s

Scaling the Architecture

Shared Content Options
NFS (simple, SPOF unless HA)
GFS2 (clustered FS)
DRBD (block replication)

Multi-Site
Anti-colocation rules
geo-clusters (advanced)
Application-level sync

Common Production Issues (And Root Causes)
  • CIB Sync Fail: pcs cluster sync.
  • Resource Stuck: pcs resource cleanup WebServer.
  • DC Election Loop: Check corosync.conf ring0_addr; ensure ntp sync.
  • SELinux Denials: ausearch -m avc -ts recent | audit2allow.
  • Metrics: pcs resource op defaults timeout=90s for slow starts.
Best Practices Checklist
  • Use 3+ nodes or qdevice
  • Always enable STONITH
  • Tune timeouts conservatively
  • Test fencing quarterly
  • Monitor failcounts
  • Document constraints
  • Never ignore quorum casually
Final Thoughts
Pacemaker and Corosync are not just HA tools — they’re distributed systems with strong opinions about safety.
If you:
  • Respect quorum
  • Configure fencing correctly
  • Tune failure detection
  • Test realistically
You’ll get predictable, fast, and safe failover.

No comments:

Post a Comment