Pages

AIX PowerHA(HACMP) Cluster Overview

High availability (HA) is essential in enterprise AIX environments, where downtime translates to significant business losses. IBM PowerHA SystemMirror, formerly known as HACMP (High Availability Cluster Multi-Processing), is IBM’s flagship clustering solution for AIX running on Power Systems.

It ensures near-continuous application uptime by automating failure detection, failover, and recovery — making it a cornerstone of mission-critical AIX deployments.

PowerHA SystemMirror:
PowerHA SystemMirror provides a robust framework for high availability and disaster recovery in AIX environments. It integrates deeply with Cluster Aware AIX (CAA) and Reliable Scalable Cluster Technology (RSCT), creating an intelligent cluster that can detect failures, reassign resources, and recover services automatically.

With PowerHA, applications continue running seamlessly, even during hardware, network, or node failures.

Core Architecture and Components:
At its foundation, PowerHA clusters consist of nodes, networks, shared storage, and resources coordinated by a suite of daemons and management utilities.

Cluster Nodes:
  • Each node runs AIX and participates in the cluster.
  • Supports up to 32 nodes for large-scale environments.
Cluster Networks:
  • Internal (heartbeat) network for node-to-node communication.
  • Service network for client access using floating service IPs.
Example:
Node1_Boot: 192.168.10.101 Node2_Boot: 192.168.10.102 Service IP: 192.168.10.201

Shared Storage:
  • Repository Disk: Stores cluster configuration and locking data.
  • Application/NFS Disks: Shared via Enhanced Concurrent Volume Groups (ECVGs).
Example:
Repository Disk: hdisk2 Application VG: nfs_vg Logical Volume: nfs_lv

PowerHA Cluster Daemons and Services
Daemon                            Function
lstrmgrES   Cluster manager; maintains heartbeat, manages events, failover logic.
clcomdES   Handles node-to-node communication.
cllockd     Provides distributed resource locking.
gsclvmd    Manages Enhanced Concurrent Volume Groups (ECVGs).
clsmuxpd   Delivers cluster status monitoring services.

These daemons ensure that the cluster maintains consistency, communicates efficiently, and executes failovers seamlessly.

PowerHA Failover Process (Logical Flow)
  • Node1 hosts the active resources — applications, service IPs, and shared disks.
  • A heartbeat failure or node crash is detected by clstrmgrES via CAA.
  • Cluster daemons automatically relocate resources to Node2.
  • The Service IP and associated applications start on Node2.
  • Clients continue to connect without service interruption.
This automated recovery process ensures minimal downtime and data consistency.

PowerHA Startup & Failover Policies
Start cluster on home node only 
# clmgr start cluster -o home
Failover to next node 
# clmgr set failover -g app_rg -m next
Never fallback 
# clmgr set fallback -g app_rg -m never
Stop cluster gracefully 
# clmgr stop cluster -m graceful  

Key Features of PowerHA 7.2
  • Supports up to 32-node clusters.
  • Full integration with CAA and RSCT frameworks.
  • Enhanced Concurrent Volume Groups (ECVGs) for shared disk access.
  • Flexible Startup, Failover, and Fallback policies.
  • Dynamic Automatic Reconfiguration (DARE) snapshots for live configuration capture.
  • Simplified management via C-SPOC (Cluster Single Point of Control).
Required Filesets
Ensure the following filesets are installed on all nodes:
  • cluster.es.client – Client components
  • cluster.es.server – Server components
  • cluster.es.cspoc – Cluster Single Point of Control
  • bos.clvm – Required for enhanced concurrent volume groups
Cluster Awareness and Communication: CAA Integration
Cluster Aware AIX (CAA) is the kernel-level clustering infrastructure beneath PowerHA.
It handles:
  • Heartbeat monitoring
  • Repository disk access
  • Network and node failure detection
  • Cluster configuration synchronization
Key CAA daemons include:
  • clcomd – Communication handler
  • clconfd – Synchronizes configuration changes (~ every 10 minutes)
  • ctrmc – Monitors resources (part of RSCT)
  • clstrmgrES – PowerHA’s cluster manager daemon

The CAA Repository Disk
The repository disk is the central coordination point for PowerHA clusters.
Key Facts:
  • Dedicated use only — cannot store application data.
  • Typical size: 512 MB – 10 GB
  • Managed exclusively by CAA (not standard LVM).
  • Ensures consistency across all cluster nodes.
  • Recommended: RAID and multipathing for redundancy.
The repository disk enables heartbeat persistence even if the network fails — ensuring continuous cluster integrity.

Deadman Switch (DMS): Cluster Safety Mechanism
The Deadman Switch (DMS) protects cluster integrity by detecting hung or isolated nodes.
Modes:
  • Mode "a" (assert): Forces node crash to prevent split-brain.
  • Mode "e" (event): Triggers an AHAFS event for manual intervention.
By enforcing these safety protocols, PowerHA prevents data corruption during severe node/network failures.

RSCT – Reliable Scalable Cluster Technology
RSCT is the backbone of PowerHA, providing monitoring, event handling, and system coordination.
Components:
  • RMC (Resource Monitoring and Control): Tracks cluster resources.
  • HAGS (Group Services): Handles cluster messaging and coordination.
  • HATS (Topology Services): Monitors heartbeat and detects failures.
  • SRC (System Resource Controller): Manages daemon processes.
RSCT organizes nodes into:
  • Peer Domains (operational clusters)
  • Management Domains (administrative supervision)

PowerHA Cluster Services
PowerHA relies on tightly integrated services to ensure continuous operation:
  • clstrmgrES: Main cluster manager
  • clevmgrdES: Manages shared LVM coordination
  • clinfoES: Provides monitoring and status info
  • RSCT and CAA daemons: Enable communication, health checks, and configuration sync
Together, these maintain a robust high-availability ecosystem.

Cluster Verification: clverify
Before deployment or after any configuration change, PowerHA uses clverify to check cluster consistency.
It detects:
  • Network misconfigurations
  • Volume group mismatches
  • Missing resources
Logs are stored at:
# cat /var/hacmp/clverify/clverify.log
Verification can be run via CLI or SMIT, ensuring a healthy cluster before going live.

C-SPOC (Cluster Single Point of Control)
C-SPOC simplifies administration by letting you manage the entire cluster from a single node.
Functions include:
  • Synchronizing configuration changes across all nodes
  • Managing volume groups and user accounts
  • Propagating commands securely via clcomd
This reduces complexity and ensures operational consistency.

Application Server & Monitor
Application Server: Hosts the clustered application.
Application Monitor: Ensures service health via two methods:
  • Process Monitoring: Tracks app processes through RSCT.
  • Custom Monitoring: Uses scripts to validate service functionality.
If an application fails, PowerHA can restart it locally or fail it over to another node.

DARE (Dynamic Automatic Reconfiguration) Snapshot
DARE snapshots capture complete cluster configurations live, allowing rollback or restoration.
  • Stored in /usr/es/sbin/cluster/snapshots
  • Used for troubleshooting, change rollback, or migration
  • Works without stopping the cluster
Network Topology, Persistent IPs, and Service IPs
Persistent Node IPs:
Static IPs for administrative access; remain on the same node.
Service IPs:
Floating IPs tied to resource groups; move automatically during failover.

Redundant networks, multiple NICs, and heartbeat links provide fault tolerance and seamless failover.

Logs and Diagnostics
Log File                                        Description
/var/hacmp/log/clstrmgr.debug    Cluster manager debug logs
/var/hacmp/adm/cluster.log    General cluster events
/var/hacmp/clverify/clverify.log    Verification logs
/var/hacmp/log/cspoc.log    CSPOC operations
/var/hacmp/adm/history/    Daily cluster activity

No comments:

Post a Comment