AIX PowerHA(HACMP) Cluster Overview

High availability (HA) is essential in enterprise AIX environments, where downtime translates to significant business losses. IBM PowerHA SystemMirror, formerly known as HACMP (High Availability Cluster Multi-Processing), is IBM’s flagship clustering solution for AIX running on Power Systems.

It ensures near-continuous application uptime by automating failure detection, failover, and recovery — making it a cornerstone of mission-critical AIX deployments.

PowerHA SystemMirror:
PowerHA SystemMirror provides a robust framework for high availability and disaster recovery in AIX environments. It integrates deeply with Cluster Aware AIX (CAA) and Reliable Scalable Cluster Technology (RSCT), creating an intelligent cluster that can detect failures, reassign resources, and recover services automatically.

With PowerHA, applications continue running seamlessly, even during hardware, network, or node failures.

Core Architecture and Components:
At its foundation, PowerHA clusters consist of nodes, networks, shared storage, and resources coordinated by a suite of daemons and management utilities.

Cluster Nodes:

Each node runs AIX and participates in the cluster.
Supports up to 32 nodes for large-scale environments.

Cluster Networks:

Internal (heartbeat) network for node-to-node communication.
Service network for client access using floating service IPs.

Example:
Node1_Boot: 192.168.10.101 Node2_Boot: 192.168.10.102 Service IP: 192.168.10.201

Shared Storage:

Repository Disk: Stores cluster configuration and locking data.
Application/NFS Disks: Shared via Enhanced Concurrent Volume Groups (ECVGs).

Example:
Repository Disk: hdisk2 Application VG: nfs_vg Logical Volume: nfs_lv

PowerHA Cluster Daemons and Services

Daemon	Function

lstrmgrES → Cluster manager; maintains heartbeat, manages events, failover logic.

clcomdES → Handles node-to-node communication.

cllockd → Provides distributed resource locking.

gsclvmd → Manages Enhanced Concurrent Volume Groups (ECVGs).

clsmuxpd → Delivers cluster status monitoring services.

These daemons ensure that the cluster maintains consistency, communicates efficiently, and executes failovers seamlessly.

PowerHA Failover Process (Logical Flow)

Node1 hosts the active resources — applications, service IPs, and shared disks.
A heartbeat failure or node crash is detected by clstrmgrES via CAA.
Cluster daemons automatically relocate resources to Node2.
The Service IP and associated applications start on Node2.
Clients continue to connect without service interruption.

This automated recovery process ensures minimal downtime and data consistency.

PowerHA Startup & Failover Policies

Start cluster on home node only

# clmgr start cluster -o home
Failover to next node

# clmgr set failover -g app_rg -m next
Never fallback

# clmgr set fallback -g app_rg -m never
Stop cluster gracefully

# clmgr stop cluster -m graceful

Key Features of PowerHA 7.2

Supports up to 32-node clusters.
Full integration with CAA and RSCT frameworks.
Enhanced Concurrent Volume Groups (ECVGs) for shared disk access.
Flexible Startup, Failover, and Fallback policies.
Dynamic Automatic Reconfiguration (DARE) snapshots for live configuration capture.
Simplified management via C-SPOC (Cluster Single Point of Control).

Required Filesets
Ensure the following filesets are installed on all nodes:

cluster.es.client – Client components
cluster.es.server – Server components
cluster.es.cspoc – Cluster Single Point of Control
bos.clvm – Required for enhanced concurrent volume groups

Cluster Awareness and Communication: CAA Integration
Cluster Aware AIX (CAA) is the kernel-level clustering infrastructure beneath PowerHA.
It handles:

Heartbeat monitoring
Repository disk access
Network and node failure detection
Cluster configuration synchronization

Key CAA daemons include:

clcomd – Communication handler
clconfd – Synchronizes configuration changes (~ every 10 minutes)
ctrmc – Monitors resources (part of RSCT)
clstrmgrES – PowerHA’s cluster manager daemon

The CAA Repository Disk
The repository disk is the central coordination point for PowerHA clusters.
Key Facts:

Dedicated use only — cannot store application data.
Typical size: 512 MB – 10 GB
Managed exclusively by CAA (not standard LVM).
Ensures consistency across all cluster nodes.
Recommended: RAID and multipathing for redundancy.

The repository disk enables heartbeat persistence even if the network fails — ensuring continuous cluster integrity.

Deadman Switch (DMS): Cluster Safety Mechanism
The Deadman Switch (DMS) protects cluster integrity by detecting hung or isolated nodes.
Modes:

Mode "a" (assert): Forces node crash to prevent split-brain.
Mode "e" (event): Triggers an AHAFS event for manual intervention.

By enforcing these safety protocols, PowerHA prevents data corruption during severe node/network failures.

RSCT – Reliable Scalable Cluster Technology
RSCT is the backbone of PowerHA, providing monitoring, event handling, and system coordination.
Components:

RMC (Resource Monitoring and Control): Tracks cluster resources.
HAGS (Group Services): Handles cluster messaging and coordination.
HATS (Topology Services): Monitors heartbeat and detects failures.
SRC (System Resource Controller): Manages daemon processes.

RSCT organizes nodes into:

Peer Domains (operational clusters)
Management Domains (administrative supervision)

PowerHA Cluster Services
PowerHA relies on tightly integrated services to ensure continuous operation:

clstrmgrES: Main cluster manager
clevmgrdES: Manages shared LVM coordination
clinfoES: Provides monitoring and status info
RSCT and CAA daemons: Enable communication, health checks, and configuration sync

Together, these maintain a robust high-availability ecosystem.

Cluster Verification: clverify
Before deployment or after any configuration change, PowerHA uses clverify to check cluster consistency.
It detects:

Network misconfigurations
Volume group mismatches
Missing resources

Logs are stored at:
# cat /var/hacmp/clverify/clverify.log
Verification can be run via CLI or SMIT, ensuring a healthy cluster before going live.

C-SPOC (Cluster Single Point of Control)
C-SPOC simplifies administration by letting you manage the entire cluster from a single node.
Functions include:

Synchronizing configuration changes across all nodes
Managing volume groups and user accounts
Propagating commands securely via clcomd

This reduces complexity and ensures operational consistency.

Application Server & Monitor
Application Server: Hosts the clustered application.
Application Monitor: Ensures service health via two methods:

Process Monitoring: Tracks app processes through RSCT.
Custom Monitoring: Uses scripts to validate service functionality.

If an application fails, PowerHA can restart it locally or fail it over to another node.

DARE (Dynamic Automatic Reconfiguration) Snapshot
DARE snapshots capture complete cluster configurations live, allowing rollback or restoration.

Stored in /usr/es/sbin/cluster/snapshots
Used for troubleshooting, change rollback, or migration
Works without stopping the cluster

Network Topology, Persistent IPs, and Service IPs
Persistent Node IPs:
Static IPs for administrative access; remain on the same node.
Service IPs:
Floating IPs tied to resource groups; move automatically during failover.

Redundant networks, multiple NICs, and heartbeat links provide fault tolerance and seamless failover.

Logs and Diagnostics
Log File Description

/var/hacmp/log/clstrmgr.debug → Cluster manager debug logs

/var/hacmp/adm/cluster.log → General cluster events

/var/hacmp/clverify/clverify.log → Verification logs

/var/hacmp/log/cspoc.log → CSPOC operations

/var/hacmp/adm/history/ → Daily cluster activity

adminCtrlX – Simplifying System Administration

Pages

AIX PowerHA(HACMP) Cluster Overview

No comments:

Post a Comment