Pages

AIX PowerHA(HACMP) Cluster Overview

High availability (HA) is essential in enterprise AIX environments, where downtime translates to significant business losses. IBM PowerHA SystemMirror, formerly known as HACMP (High Availability Cluster Multi-Processing), is IBM’s flagship clustering solution for AIX running on Power Systems.

It ensures near-continuous application uptime by automating failure detection, failover, and recovery — making it a cornerstone of mission-critical AIX deployments.

PowerHA SystemMirror:
PowerHA SystemMirror provides a robust framework for high availability and disaster recovery in AIX environments. It integrates deeply with Cluster Aware AIX (CAA) and Reliable Scalable Cluster Technology (RSCT), creating an intelligent cluster that can detect failures, reassign resources, and recover services automatically.

With PowerHA, applications continue running seamlessly, even during hardware, network, or node failures.

Core Architecture and Components:
At its foundation, PowerHA clusters consist of nodes, networks, shared storage, and resources coordinated by a suite of daemons and management utilities.

Cluster Nodes:
  • Each node runs AIX and participates in the cluster.
  • Supports up to 32 nodes for large-scale environments.
Cluster Networks:
  • Internal (heartbeat) network for node-to-node communication.
  • Service network for client access using floating service IPs.
Example:
Node1_Boot: 192.168.10.101 Node2_Boot: 192.168.10.102 Service IP: 192.168.10.201

Shared Storage:
  • Repository Disk: Stores cluster configuration and locking data.
  • Application/NFS Disks: Shared via Enhanced Concurrent Volume Groups (ECVGs).
Example:
Repository Disk: hdisk2 Application VG: nfs_vg Logical Volume: nfs_lv

PowerHA Cluster Daemons and Services
Daemon                            Function
lstrmgrES   Cluster manager; maintains heartbeat, manages events, failover logic.
clcomdES   Handles node-to-node communication.
cllockd     Provides distributed resource locking.
gsclvmd    Manages Enhanced Concurrent Volume Groups (ECVGs).
clsmuxpd   Delivers cluster status monitoring services.

These daemons ensure that the cluster maintains consistency, communicates efficiently, and executes failovers seamlessly.

PowerHA Failover Process (Logical Flow)
  • Node1 hosts the active resources — applications, service IPs, and shared disks.
  • A heartbeat failure or node crash is detected by clstrmgrES via CAA.
  • Cluster daemons automatically relocate resources to Node2.
  • The Service IP and associated applications start on Node2.
  • Clients continue to connect without service interruption.
This automated recovery process ensures minimal downtime and data consistency.

PowerHA Startup & Failover Policies
Start cluster on home node only 
# clmgr start cluster -o home
Failover to next node 
# clmgr set failover -g app_rg -m next
Never fallback 
# clmgr set fallback -g app_rg -m never
Stop cluster gracefully 
# clmgr stop cluster -m graceful  

Key Features of PowerHA 7.2
  • Supports up to 32-node clusters.
  • Full integration with CAA and RSCT frameworks.
  • Enhanced Concurrent Volume Groups (ECVGs) for shared disk access.
  • Flexible Startup, Failover, and Fallback policies.
  • Dynamic Automatic Reconfiguration (DARE) snapshots for live configuration capture.
  • Simplified management via C-SPOC (Cluster Single Point of Control).
Required Filesets
Ensure the following filesets are installed on all nodes:
  • cluster.es.client – Client components
  • cluster.es.server – Server components
  • cluster.es.cspoc – Cluster Single Point of Control
  • bos.clvm – Required for enhanced concurrent volume groups
Cluster Awareness and Communication: CAA Integration
Cluster Aware AIX (CAA) is the kernel-level clustering infrastructure beneath PowerHA.
It handles:
  • Heartbeat monitoring
  • Repository disk access
  • Network and node failure detection
  • Cluster configuration synchronization
Key CAA daemons include:
  • clcomd – Communication handler
  • clconfd – Synchronizes configuration changes (~ every 10 minutes)
  • ctrmc – Monitors resources (part of RSCT)
  • clstrmgrES – PowerHA’s cluster manager daemon

The CAA Repository Disk
The repository disk is the central coordination point for PowerHA clusters.
Key Facts:
  • Dedicated use only — cannot store application data.
  • Typical size: 512 MB – 10 GB
  • Managed exclusively by CAA (not standard LVM).
  • Ensures consistency across all cluster nodes.
  • Recommended: RAID and multipathing for redundancy.
The repository disk enables heartbeat persistence even if the network fails — ensuring continuous cluster integrity.

Deadman Switch (DMS): Cluster Safety Mechanism
The Deadman Switch (DMS) protects cluster integrity by detecting hung or isolated nodes.
Modes:
  • Mode "a" (assert): Forces node crash to prevent split-brain.
  • Mode "e" (event): Triggers an AHAFS event for manual intervention.
By enforcing these safety protocols, PowerHA prevents data corruption during severe node/network failures.

RSCT – Reliable Scalable Cluster Technology
RSCT is the backbone of PowerHA, providing monitoring, event handling, and system coordination.
Components:
  • RMC (Resource Monitoring and Control): Tracks cluster resources.
  • HAGS (Group Services): Handles cluster messaging and coordination.
  • HATS (Topology Services): Monitors heartbeat and detects failures.
  • SRC (System Resource Controller): Manages daemon processes.
RSCT organizes nodes into:
  • Peer Domains (operational clusters)
  • Management Domains (administrative supervision)

PowerHA Cluster Services
PowerHA relies on tightly integrated services to ensure continuous operation:
  • clstrmgrES: Main cluster manager
  • clevmgrdES: Manages shared LVM coordination
  • clinfoES: Provides monitoring and status info
  • RSCT and CAA daemons: Enable communication, health checks, and configuration sync
Together, these maintain a robust high-availability ecosystem.

Cluster Verification: clverify
Before deployment or after any configuration change, PowerHA uses clverify to check cluster consistency.
It detects:
  • Network misconfigurations
  • Volume group mismatches
  • Missing resources
Logs are stored at:
# cat /var/hacmp/clverify/clverify.log
Verification can be run via CLI or SMIT, ensuring a healthy cluster before going live.

C-SPOC (Cluster Single Point of Control)
C-SPOC simplifies administration by letting you manage the entire cluster from a single node.
Functions include:
  • Synchronizing configuration changes across all nodes
  • Managing volume groups and user accounts
  • Propagating commands securely via clcomd
This reduces complexity and ensures operational consistency.

Application Server & Monitor
Application Server: Hosts the clustered application.
Application Monitor: Ensures service health via two methods:
  • Process Monitoring: Tracks app processes through RSCT.
  • Custom Monitoring: Uses scripts to validate service functionality.
If an application fails, PowerHA can restart it locally or fail it over to another node.

DARE (Dynamic Automatic Reconfiguration) Snapshot
DARE snapshots capture complete cluster configurations live, allowing rollback or restoration.
  • Stored in /usr/es/sbin/cluster/snapshots
  • Used for troubleshooting, change rollback, or migration
  • Works without stopping the cluster
Network Topology, Persistent IPs, and Service IPs
Persistent Node IPs:
Static IPs for administrative access; remain on the same node.
Service IPs:
Floating IPs tied to resource groups; move automatically during failover.

Redundant networks, multiple NICs, and heartbeat links provide fault tolerance and seamless failover.

Logs and Diagnostics
Log File                                        Description
/var/hacmp/log/clstrmgr.debug    Cluster manager debug logs
/var/hacmp/adm/cluster.log    General cluster events
/var/hacmp/clverify/clverify.log    Verification logs
/var/hacmp/log/cspoc.log    CSPOC operations
/var/hacmp/adm/history/    Daily cluster activity

AIX mksysb Backup


Here’s the MKSYSB backup script that:
  • Creates the MKSYSB backup
  • Copies it to the NIM server
  • Sends an email on success or failure
  • Full Script with Email Notification
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
#!/bin/ksh
# Variables
DATE=$(date +%Y%m%d_%H%M)
HOSTNAME=$(hostname)
MKSYSB_FILE="/backup/mksysb_${HOSTNAME}_${DATE}.mksysb"
NIM_SERVER="master" # Your NIM server hostname or IP
NIM_BACKUP_DIR="/export/mksysb_backups"
REMOTE_USER="nimadmin" # Remote server user name
EMAIL_TO="admin@example.com" # Change to your email address
EMAIL_FROM="noreply@example.com"
SUBJECT_SUCCESS="MKSYSB Backup Completed Successful on ${HOSTNAME}"
SUBJECT_FAIL="MKSYSB Backup FAILED on ${HOSTNAME}"

# Function to send email
send_email() {
local subject=$1
local message=$2
(
echo "From: $EMAIL_FROM"
echo "To: $EMAIL_TO"
echo "Subject: $subject"
echo ""
echo "$message"
) | /usr/sbin/sendmail -t
}

# Start backup
echo "Starting MKSYSB backup at $(date)..."
/usr/bin/mksysb -i -X "$MKSYSB_FILE"
if [ $? -ne 0 ]; then
send_email "$SUBJECT_FAIL" "MKSYSB backup failed on ${HOSTNAME} at $(date)."
echo "MKSYSB backup failed!"
exit 1
fi
echo "MKSYSB backup created: $MKSYSB_FILE"

# Copy to NIM server
echo "Copying MKSYSB backup to NIM server $NIM_SERVER..."
scp "$MKSYSB_FILE" "${REMOTE_USER}@${NIM_SERVER}:${NIM_BACKUP_DIR}/"
if [ $? -ne 0 ]; then
send_email "$SUBJECT_FAIL" "Failed to copy MKSYSB backup to NIM server (${NIM_SERVER}) from ${HOSTNAME} at $(date)."
echo "Failed to copy MKSYSB to NIM server!"
exit 2
fi

# Cleanup local backups, keep last 5
echo "Cleaning up old local backups..."
ls -1tr /backup/mksysb_${HOSTNAME}_*.mksysb | head -n -5 | xargs -r rm --

# Send success email
send_email "$SUBJECT_SUCCESS" "MKSYSB backup completed and copied to NIM server successfully on ${HOSTNAME} at $(date)."
echo "Backup process completed successfully."
exit 0

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Scheduling
Add to cron to automate:
0 2 * * * /path/to/mksysb_backup.sh >> /var/log/mksysb_backup.log 2>&1

At 2:00 AM every day, run /path/to/mksysb_backup.sh and append output and errors to /var/log/mksysb_backup.log.

AIX Backup

AIX admins know that backups aren’t optional—they’re your lifeline for downtime recovery, server migrations, and disaster recovery (DR). This guide dives deep into mksysb (rootvg images), savevg (VG snapshots), and tar (file archives), including commands, options, restores, prerequisites, and pro tips. Let’s level up your AIX backup game.

MKSYSB Backup:
MKSYSB is a bootable backup of the root volume group (rootvg) on AIX.
It contains OS files, configuration, and can be used to restore or clone the system.

MKSYSB Backup Command
Example:
# mksysb -i -X /backup/mksysb_$(hostname)_$(date +%Y%m%d).mksysb

This creates a backup file with the hostname and date in the filename, e.g., /backup/mksysb_appserver1_20251015.mksysb.

Useful Options:
-i: Creates an incremental backup if applicable (saves time and space).
-X: Excludes the /home directory (if you want to exclude user data).
-e: Excludes specific files or directories.
-c: Performs consistency check before backup.

MKSYSB Backup Script:
1.Jump server & aix servers should have passwordless authentication. 
2.The script will mount NFS shared to AIX servers
3.To take mksysb(roovg) backup to the NFS shared 
4.Unmount NFS shared 
5.Run script ./remote_backup_aix_mksysb.sh <server1> <server2> <server3> ...

Example Script: remote_backup_aix_mksysb.sh
--------------------------------------------------------------------------------------------------------------
#!/bin/bash
# ===== CONFIG =====
REMOTE_USER="root"
NFS_SERVER="192.168.10.11"
NFS_PATH="/aix/backup"
MOUNT_POINT="/mnt"
# ===== CHECK INPUT =====
if [ $# -lt 1 ]; then
    echo "Usage: $0 <server1> <server2> <server3> ..."
    exit 1
fi
# ===== LOOP THROUGH ALL SERVERS =====
for REMOTE_HOST in "$@"
do
    echo "==============================================="
    echo "Starting backup on: ${REMOTE_HOST}"
    echo "==============================================="
    ssh -o BatchMode=yes ${REMOTE_USER}@${REMOTE_HOST} << EOF
echo "Connected to \$(hostname)"
nfso -o nfs_use_reserved_ports=1
# Check if already mounted
if mount | grep " ${MOUNT_POINT} " > /dev/null 2>&1
then
    echo "${MOUNT_POINT} already mounted.............."
else
    echo "Mounting NFS share.........................."
    mount ${NFS_SERVER}:${NFS_PATH} ${MOUNT_POINT}
    if [ \$? -ne 0 ]; then
        echo "ERROR: Mount failed....................."
        exit 1
    fi
fi
HOSTNAME=\$(hostname)
DATE=\$(date +%Y%m%d)
BACKUP_DIR=${MOUNT_POINT}/backup
BACKUP_FILE=\${BACKUP_DIR}/mksysb_\${HOSTNAME}_\${DATE}.mksysb
mkdir -p \${BACKUP_DIR}
echo "Starting mksysb backup..."
mksysb -i -X \${BACKUP_FILE}
if [ \$? -ne 0 ]; then
    echo "ERROR: mksysb failed."
    exit 1
fi
echo "Backup completed successfully."
echo "Unmounting ${MOUNT_POINT}..."
umount ${MOUNT_POINT}
if [ \$? -ne 0 ]; then
    echo "WARNING: Unmount failed....."
fi
echo "Finished on \$(hostname)"
exit 0
EOF
    if [ $? -eq 0 ]; then
        echo "SUCCESS: ${REMOTE_HOST} backup completed."
    else
        echo "FAILED: ${REMOTE_HOST} backup failed."
    fi

    echo ""
done
echo "All servers processed...................."
--------------------------------------------------------------------------------------------------------------

SAVEVG Backup:
savevg is an AIX command used to create a backup of a volume group (VG), including all logical volumes and data in that VG.
  • Backing up entire volume groups before making changes.
  • Migrating volume groups.
  • Disaster recovery.

Basic syntax:
# savevg -f <backup_file_path> <vgname>
-f <backup_file_path>: Specifies the path and filename where the VG backup will be saved.
<vgname>: Name of the volume group you want to back up.

Tar Backup:
tar (tape archive) bundles multiple files/directories into a single archive file.
Often used with compression (gzip or bzip2) to save space.

Basic tar backup command
To create a backup archive of a directory, for example /home:
# tar -cvf /backup/home_backup_$(date +%Y%m%d).tar /home
-c = create a new archive
-v = verbose (lists files as they're archived)
-f = specifies the filename of the archive

Compressing the tar archive with gzip
# tar -czvf /backup/home_backup_$(date +%Y%m%d).tar.gz /home
-z = compress the archive using gzip
Compressing the tar archive with bzip2 (better compression)
# tar -cjvf /backup/home_backup_$(date +%Y%m%d).tar.bz2 /home
-j = compress using bzip2

Extracting from a tar archive

Without compression:

# tar -xvf archive.tar
With gzip compression:
# tar -xzvf archive.tar.gz
With bzip2 compression:
# tar -xjvf archive.tar.bz2

Example: Backup /etc directory to compressed archive
# tar -czvf /backup/etc_backup_$(date +%Y%m%d).tar.gz /etc

IBM NIM

Network Installation Manager (NIM) is an IBM tool designed to automate the installation, configuration, and maintenance of AIX operating systems across multiple machines over a network.
  • Centralized management of all AIX installations
  • Standardized deployment of operating systems, updates, and patches
  • Automated system backups and recovery
  • Support for both diskless and disk-based clients
NIM Master:
The NIM Master is the central hub of the entire setup. It stores and manages all resources needed for client installations or maintenance. Key resources include:
  • LPP_SOURCE – The AIX installation files and updates.
  • SPOT – A bootable temporary environment for network installations.
  • MKSYSB – Full system backup images of clients.
  • CONFIG / SCRIPT – Automation scripts and configuration files to standardize client setups.
The Master communicates with clients primarily over TCP port 475, which is reserved for NIM protocol operations. This ensures commands, coordination, and status updates flow reliably between Master and Client.

Network Services:
Network services facilitate client booting and resource access:
  • BOOTP/DHCP – Assigns IP addresses and provides boot parameters to clients. Diskless and disk-based clients both request their network configuration from the Master at startup.
  • TFTP (Trivial File Transfer Protocol) – Transfers the SPOT boot image from the NIM Master to clients. This happens during the network boot phase.
  • NFS (Network File System) – Allows clients to mount NIM resources like LPP_SOURCE, MKSYSB, CONFIG, and SCRIPT without storing them locally.
NIM Clients:
There are two types of clients in a NIM environment:
  • Diskless Clients – Boot entirely over the network without using a local disk. They rely on SPOT and NFS resources to run the OS.
  • Disk-based Clients – Standard AIX systems that use local disks but still boot over the network for installation or updates.
The client workflow follows this pattern:
  • Power-On / Network Boot – Client sends a BOOTP/DHCP request.
  • Boot Image Transfer – TFTP downloads SPOT from the NIM Master.
  • Resource Mounting – NFS mounts resources needed for installation or updates.
  • Installation / Restore / Update – Master coordinates the process over TCP 475.
  • Reboot – Once completed, the client boots from its local disk (if disk-based) with a fully configured AIX system.
Data Flow:
  • BOOTP/DHCP → IP and boot info assigned
  • TFTP → SPOT boot image sent to client
  • NFS → Resources mounted and accessed for installation
  • TCP 475 → NIM commands, status updates, and session management

The NIM installation process follows these steps:
  • Client Power-On – Client broadcasts BOOTP/DHCP request
  • IP & Boot Info Assignment – NIM Master responds with IP configuration and SPOT location
  • Boot Image Transfer – Client downloads SPOT image via TFTP
  • Resource Mounting – Client mounts LPP_SOURCE, MKSYSB, CONFIG, SCRIPT via NFS
  • Installation / Maintenance – NIM Master coordinates installation or restore over TCP port 475
  • Client Reboot – Client boots from local disk as a fully configured AIX system
Implementing NIM provides several advantages:
  • Centralized Management: Manage all AIX systems from a single NIM Master
  • Automation: Eliminate manual installations and configurations
  • Scalability: Deploy OS across dozens or hundreds of systems simultaneously
  • System Backup & Restore: Use MKSYSB images for fast disaster recovery
  • Consistency: Standardized resources minimize configuration drift
  • Faster Deployment: Network-based booting speeds up installations
  • Reduced Human Error: Scripted deployments reduce mistakes
  • Flexible Updates: Apply patches and updates centrally
  • Support for Diskless Clients: Ideal for test or thin-client environments
  • Cost Efficiency: Reduce manual labor and installation media costs
Setting Up NIM Master

Prerequisites
  • Hardware: IBM Power System (Power7/8/9/10), 4–8 GB RAM, 20 GB disk
  • OS: Supported AIX version (7.1, 7.2, 7.3)
  • NIM Packages: nim_master, bos.sysmgt.nim.master, bos.sysmgt.nim.spot
  • Network: Static IP, NFS configured, BOOTP/DHCP ready
  • Time Synchronization: NTP or chronyd recommended
Installation Steps
Install AIX on the system designated as NIM Master

Install required NIM packages:
# installp -agXd /mnt/lpp_7200-04-02/installp/ppc bos.sysmgt.nim.master bos.sysmgt.nim.spot bos.sysmgt.nim.client

Initialize NIM Master:
# smitty nim --> Configure the NIM Environment --> Advanced Configuration -->  Initialize the NIM Master Only

Configure NFS exports:
# startsrc -g nfs
# lssrc -g nfs

Subsystem         Group            PID          Status
biod             nfs              5112226      active
rpc.lockd        nfs              4325836      active
nfsd             nfs              6750656      active
rpc.mountd       nfs              6554006      active
rpc.statd        nfs              7209330      active
nfsrgyd          nfs                           inoperative
gssd             nfs                           inoperative

# vi /etc/exports
/export/lpp_source -rw,anon=0
/export/spot -rw,anon=0
# exportfs -ua
# exportfs -va
Configure BOOTP/DHCP as needed
Verify TCP Port 475 is open and NIM daemons are running

Defining NIM Resources
NIM relies on resources to manage clients:
  • LPP_SOURCE: Installation files and patches
  • SPOT: Bootable client environment
  • MKSYSB: System backup image
  • CONFIG/SCRIPT: Configuration and automation scripts
Example to define an LPP source:
Example:
Define LPP_SOURCE from DVD 1
# nim -o define -t lpp_source -a server=master -a source=/mnt/aix_7200-04-02-2027_1of2_072020.iso -a location=/export/lpp_source/lpp_7200-04-02 -a packages=all lpp_7200-04-02

Add Images from DVD 2
# nim -o update -a source=/mnt/aix_7200-04-02-2027_2of2_072020.iso -a packages=all lpp_7200-04-02

Create a SPOT from LPP_SOURCE:
# nim -o define -t spot -a server=master -a source=lpp_7200-04-02 -a location=/export/spot spot_7200-04-02

Client Operations

Install a client using SPOT & LPP_SOURCE:
# nim -o bos_inst -a spot=spot_7200-04-02 -a lpp_source=lpp_7200-04-02 <client>

Install from MKSYSB backup:
# nim -o bos_inst -a source=mksysb -a spot=spot_7200-04-02 -a mksysb=mksysb_backup <client>

Reset a client:
# nim -F -o reset <client>

NIM also supports advanced tasks like alternate disk migration and spot customization for patching or updates.

Updating LPP Source and SPOT

To update an existing LPP source with a new TL/SP:
Example:
# nim -o update -a packages=all -a source=/aix/AIX_v7.3_Install_7300-03-00-2446_LCD8299301.iso lpp_7300-03-00

Customize SPOT for alternative disk installation:
# nim -o cust -a lpp_source=lpp_7300-03-00 -a filesets=bos.alt_disk_install.rte spot_7300-03-00

# nim -o cust -a filesets=bos.alt_disk_install.boot_images -a lpp_source=lpp_7300-03-00 spot_7300-03-00

PowerHA Cluster Manually Startup

PowerHA cluster manually start using below mention script, including steps to:

  1. Check cluster IP

  2. Check network interface and assign alias if needed

  3. Vary on cluster volume groups (VGs)

  4. Mount the filesystems

Here the below script
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
#!/bin/ksh

echo "Check cluster IP Address......................................."
cltopinfo

echo "Check Network Interfaces......................................."
ifconfig -a

read -p "Please Enter the NIC card, Virtual IP Address & Subnet Mask (space separated): " niccard virtualip subnetmsk

echo "NIC Card: $niccard"
echo "Virtual IP: $virtualip"
echo "Subnet Mask: $subnetmsk"

echo "Adding Alias IP........................................................"

ifconfig $niccard alias $virtualip netmask $subnetmsk up
if [ $? -ne 0 ]; then
echo "Failed to add alias IP on $niccard"
exit 1
fi

echo "Varyon cluster volume groups and Mount Filesystems.............."

# Extract volume groups related to cluster resources

vgs=$(clshowres | grep "Volume Group" | grep "vg" | awk '{print $NF}')
if [ -z "$vgs" ]; then
echo "No volume groups found in cluster resources"
exit 1
fi

for vg in $vgs
do
echo "Varyon volume group: $vg"
varyonvg -O $vg
if [ $? -ne 0 ]; then
echo "Failed to varyon volume group $vg"
exit 1
fi

# List jfs2 filesystems in this VG (exclude jfs2log)
FS_LIST=$(lsvg -l $vg | awk '/jfs2/ && !/jfs2log/ {print $7}')
if [ -z "$FS_LIST" ]; then
echo "No JFS2 filesystems found in volume group $vg"
continue
fi
for fs in $FS_LIST
do
echo "Mounting filesystem: $fs"
mount $fs
if [ $? -ne 0 ]; then
echo "Failed to mount filesystem $fs"
exit 1
fi
done
done

echo "Mount any remaining filesystems from /etc/filesystems......."
mount -a
if [ $? -ne 0 ]; then
echo "Failed to mount filesystems"
exit 1
fi

echo "PowerHA Cluster Manual Start Script Completed..............."


------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------