adminCtrlX – Simplifying System Administration

AIX PowerHA(HACMP) Cluster Overview

High availability (HA) is essential in enterprise AIX environments, where downtime translates to significant business losses. IBM PowerHA SystemMirror, formerly known as HACMP (High Availability Cluster Multi-Processing), is IBM’s flagship clustering solution for AIX running on Power Systems.

It ensures near-continuous application uptime by automating failure detection, failover, and recovery — making it a cornerstone of mission-critical AIX deployments.

PowerHA SystemMirror:
PowerHA SystemMirror provides a robust framework for high availability and disaster recovery in AIX environments. It integrates deeply with Cluster Aware AIX (CAA) and Reliable Scalable Cluster Technology (RSCT), creating an intelligent cluster that can detect failures, reassign resources, and recover services automatically.

With PowerHA, applications continue running seamlessly, even during hardware, network, or node failures.

Core Architecture and Components:
At its foundation, PowerHA clusters consist of nodes, networks, shared storage, and resources coordinated by a suite of daemons and management utilities.

Cluster Nodes:

Each node runs AIX and participates in the cluster.
Supports up to 32 nodes for large-scale environments.

Cluster Networks:

Internal (heartbeat) network for node-to-node communication.
Service network for client access using floating service IPs.

Example:
Node1_Boot: 192.168.10.101 Node2_Boot: 192.168.10.102 Service IP: 192.168.10.201

Shared Storage:

Repository Disk: Stores cluster configuration and locking data.
Application/NFS Disks: Shared via Enhanced Concurrent Volume Groups (ECVGs).

Example:
Repository Disk: hdisk2 Application VG: nfs_vg Logical Volume: nfs_lv

PowerHA Cluster Daemons and Services

Daemon	Function

lstrmgrES → Cluster manager; maintains heartbeat, manages events, failover logic.

clcomdES → Handles node-to-node communication.

cllockd → Provides distributed resource locking.

gsclvmd → Manages Enhanced Concurrent Volume Groups (ECVGs).

clsmuxpd → Delivers cluster status monitoring services.

These daemons ensure that the cluster maintains consistency, communicates efficiently, and executes failovers seamlessly.

PowerHA Failover Process (Logical Flow)

Node1 hosts the active resources — applications, service IPs, and shared disks.
A heartbeat failure or node crash is detected by clstrmgrES via CAA.
Cluster daemons automatically relocate resources to Node2.
The Service IP and associated applications start on Node2.
Clients continue to connect without service interruption.

This automated recovery process ensures minimal downtime and data consistency.

PowerHA Startup & Failover Policies

Start cluster on home node only

# clmgr start cluster -o home
Failover to next node

# clmgr set failover -g app_rg -m next
Never fallback

# clmgr set fallback -g app_rg -m never
Stop cluster gracefully

# clmgr stop cluster -m graceful

Key Features of PowerHA 7.2

Supports up to 32-node clusters.
Full integration with CAA and RSCT frameworks.
Enhanced Concurrent Volume Groups (ECVGs) for shared disk access.
Flexible Startup, Failover, and Fallback policies.
Dynamic Automatic Reconfiguration (DARE) snapshots for live configuration capture.
Simplified management via C-SPOC (Cluster Single Point of Control).

Required Filesets
Ensure the following filesets are installed on all nodes:

cluster.es.client – Client components
cluster.es.server – Server components
cluster.es.cspoc – Cluster Single Point of Control
bos.clvm – Required for enhanced concurrent volume groups

Cluster Awareness and Communication: CAA Integration
Cluster Aware AIX (CAA) is the kernel-level clustering infrastructure beneath PowerHA.
It handles:

Heartbeat monitoring
Repository disk access
Network and node failure detection
Cluster configuration synchronization

Key CAA daemons include:

clcomd – Communication handler
clconfd – Synchronizes configuration changes (~ every 10 minutes)
ctrmc – Monitors resources (part of RSCT)
clstrmgrES – PowerHA’s cluster manager daemon

The CAA Repository Disk
The repository disk is the central coordination point for PowerHA clusters.
Key Facts:

Dedicated use only — cannot store application data.
Typical size: 512 MB – 10 GB
Managed exclusively by CAA (not standard LVM).
Ensures consistency across all cluster nodes.
Recommended: RAID and multipathing for redundancy.

The repository disk enables heartbeat persistence even if the network fails — ensuring continuous cluster integrity.

Deadman Switch (DMS): Cluster Safety Mechanism
The Deadman Switch (DMS) protects cluster integrity by detecting hung or isolated nodes.
Modes:

Mode "a" (assert): Forces node crash to prevent split-brain.
Mode "e" (event): Triggers an AHAFS event for manual intervention.

By enforcing these safety protocols, PowerHA prevents data corruption during severe node/network failures.

RSCT – Reliable Scalable Cluster Technology
RSCT is the backbone of PowerHA, providing monitoring, event handling, and system coordination.
Components:

RMC (Resource Monitoring and Control): Tracks cluster resources.
HAGS (Group Services): Handles cluster messaging and coordination.
HATS (Topology Services): Monitors heartbeat and detects failures.
SRC (System Resource Controller): Manages daemon processes.

RSCT organizes nodes into:

Peer Domains (operational clusters)
Management Domains (administrative supervision)

PowerHA Cluster Services
PowerHA relies on tightly integrated services to ensure continuous operation:

clstrmgrES: Main cluster manager
clevmgrdES: Manages shared LVM coordination
clinfoES: Provides monitoring and status info
RSCT and CAA daemons: Enable communication, health checks, and configuration sync

Together, these maintain a robust high-availability ecosystem.

Cluster Verification: clverify
Before deployment or after any configuration change, PowerHA uses clverify to check cluster consistency.
It detects:

Network misconfigurations
Volume group mismatches
Missing resources

Logs are stored at:
# cat /var/hacmp/clverify/clverify.log
Verification can be run via CLI or SMIT, ensuring a healthy cluster before going live.

C-SPOC (Cluster Single Point of Control)
C-SPOC simplifies administration by letting you manage the entire cluster from a single node.
Functions include:

Synchronizing configuration changes across all nodes
Managing volume groups and user accounts
Propagating commands securely via clcomd

This reduces complexity and ensures operational consistency.

Application Server & Monitor
Application Server: Hosts the clustered application.
Application Monitor: Ensures service health via two methods:

Process Monitoring: Tracks app processes through RSCT.
Custom Monitoring: Uses scripts to validate service functionality.

If an application fails, PowerHA can restart it locally or fail it over to another node.

DARE (Dynamic Automatic Reconfiguration) Snapshot
DARE snapshots capture complete cluster configurations live, allowing rollback or restoration.

Stored in /usr/es/sbin/cluster/snapshots
Used for troubleshooting, change rollback, or migration
Works without stopping the cluster

Network Topology, Persistent IPs, and Service IPs
Persistent Node IPs:
Static IPs for administrative access; remain on the same node.
Service IPs:
Floating IPs tied to resource groups; move automatically during failover.

Redundant networks, multiple NICs, and heartbeat links provide fault tolerance and seamless failover.

Logs and Diagnostics
Log File Description

/var/hacmp/log/clstrmgr.debug → Cluster manager debug logs

/var/hacmp/adm/cluster.log → General cluster events

/var/hacmp/clverify/clverify.log → Verification logs

/var/hacmp/log/cspoc.log → CSPOC operations

/var/hacmp/adm/history/ → Daily cluster activity

Unix System Administrator

Welcome to TechSysAdm, your go-to blog for practical insights, troubleshooting tips, and best practices in managing mission-critical enterprise systems. Here, we cover everything from AIX, RHEL, SUSE Linux, Solaris, VMware, and Windows servers to enterprise databases and DevOps environments, helping IT professionals optimize performance, ensure reliability, and solve complex system challenges.

AIX mksysb Backup

Here’s the MKSYSB backup script that:

Creates the MKSYSB backup
Copies it to the NIM server
Sends an email on success or failure
Full Script with Email Notification

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
#!/bin/ksh
# Variables
DATE=$(date +%Y%m%d_%H%M)
HOSTNAME=$(hostname)
MKSYSB_FILE="/backup/mksysb_${HOSTNAME}_${DATE}.mksysb"
NIM_SERVER="master" # Your NIM server hostname or IP
NIM_BACKUP_DIR="/export/mksysb_backups"
REMOTE_USER="nimadmin" # Remote server user name
EMAIL_TO="admin@example.com" # Change to your email address
EMAIL_FROM="noreply@example.com"
SUBJECT_SUCCESS="MKSYSB Backup Completed Successful on ${HOSTNAME}"
SUBJECT_FAIL="MKSYSB Backup FAILED on ${HOSTNAME}"

# Function to send email
send_email() {
local subject=$1
local message=$2
(
echo "From: $EMAIL_FROM"
echo "To: $EMAIL_TO"
echo "Subject: $subject"
echo ""
echo "$message"
) | /usr/sbin/sendmail -t
}

# Start backup
echo "Starting MKSYSB backup at $(date)..."
/usr/bin/mksysb -i -X "$MKSYSB_FILE"
if [ $? -ne 0 ]; then
send_email "$SUBJECT_FAIL" "MKSYSB backup failed on ${HOSTNAME} at $(date)."
echo "MKSYSB backup failed!"
exit 1
fi
echo "MKSYSB backup created: $MKSYSB_FILE"

# Copy to NIM server
echo "Copying MKSYSB backup to NIM server $NIM_SERVER..."
scp "$MKSYSB_FILE" "${REMOTE_USER}@${NIM_SERVER}:${NIM_BACKUP_DIR}/"
if [ $? -ne 0 ]; then
send_email "$SUBJECT_FAIL" "Failed to copy MKSYSB backup to NIM server (${NIM_SERVER}) from ${HOSTNAME} at $(date)."
echo "Failed to copy MKSYSB to NIM server!"
exit 2
fi

# Cleanup local backups, keep last 5
echo "Cleaning up old local backups..."
ls -1tr /backup/mksysb_${HOSTNAME}_*.mksysb | head -n -5 | xargs -r rm --

# Send success email
send_email "$SUBJECT_SUCCESS" "MKSYSB backup completed and copied to NIM server successfully on ${HOSTNAME} at $(date)."
echo "Backup process completed successfully."
exit 0

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Scheduling
Add to cron to automate:
0 2 * * * /path/to/mksysb_backup.sh >> /var/log/mksysb_backup.log 2>&1

At 2:00 AM every day, run /path/to/mksysb_backup.sh and append output and errors to /var/log/mksysb_backup.log.

Unix System Administrator

AIX Backup

AIX admins know that backups aren’t optional—they’re your lifeline for downtime recovery, server migrations, and disaster recovery (DR). This guide dives deep into mksysb (rootvg images), savevg (VG snapshots), and tar (file archives), including commands, options, restores, prerequisites, and pro tips. Let’s level up your AIX backup game.

MKSYSB Backup:
MKSYSB is a bootable backup of the root volume group (rootvg) on AIX.
It contains OS files, configuration, and can be used to restore or clone the system.

MKSYSB Backup Command
Example:
# mksysb -i -X /backup/mksysb_$(hostname)_$(date +%Y%m%d).mksysb

This creates a backup file with the hostname and date in the filename, e.g., /backup/mksysb_appserver1_20251015.mksysb.

Useful Options:
-i: Creates an incremental backup if applicable (saves time and space).
-X: Excludes the /home directory (if you want to exclude user data).
-e: Excludes specific files or directories.
-c: Performs consistency check before backup.

MKSYSB Backup Script:

1.Jump server & aix servers should have passwordless authentication.

2.The script will mount NFS shared to AIX servers

3.To take mksysb(roovg) backup to the NFS shared

4.Unmount NFS shared

5.Run script ./remote_backup_aix_mksysb.sh <server1> <server2> <server3> ...

Example Script: remote_backup_aix_mksysb.sh

--------------------------------------------------------------------------------------------------------------

#!/bin/bash

# ===== CONFIG =====

REMOTE_USER="root"

NFS_SERVER="192.168.10.11"

NFS_PATH="/aix/backup"

MOUNT_POINT="/mnt"

# ===== CHECK INPUT =====

if [ $# -lt 1 ]; then

echo "Usage: $0 <server1> <server2> <server3> ..."

exit 1

# ===== LOOP THROUGH ALL SERVERS =====

for REMOTE_HOST in "$@"

echo "==============================================="

echo "Starting backup on: ${REMOTE_HOST}"

echo "==============================================="

ssh -o BatchMode=yes ${REMOTE_USER}@${REMOTE_HOST} << EOF

echo "Connected to \$(hostname)"

nfso -o nfs_use_reserved_ports=1

# Check if already mounted

if mount | grep " ${MOUNT_POINT} " > /dev/null 2>&1

then

echo "${MOUNT_POINT} already mounted.............."

else

echo "Mounting NFS share.........................."

mount ${NFS_SERVER}:${NFS_PATH} ${MOUNT_POINT}

if [ \$? -ne 0 ]; then

echo "ERROR: Mount failed....................."

exit 1

HOSTNAME=\$(hostname)

DATE=\$(date +%Y%m%d)

BACKUP_DIR=${MOUNT_POINT}/backup

BACKUP_FILE=\${BACKUP_DIR}/mksysb_\${HOSTNAME}_\${DATE}.mksysb

mkdir -p \${BACKUP_DIR}

echo "Starting mksysb backup..."

mksysb -i -X \${BACKUP_FILE}

if [ \$? -ne 0 ]; then

echo "ERROR: mksysb failed."

exit 1

echo "Backup completed successfully."

echo "Unmounting ${MOUNT_POINT}..."

umount ${MOUNT_POINT}

if [ \$? -ne 0 ]; then

echo "WARNING: Unmount failed....."

echo "Finished on \$(hostname)"

exit 0

EOF

if [ $? -eq 0 ]; then

echo "SUCCESS: ${REMOTE_HOST} backup completed."

else

echo "FAILED: ${REMOTE_HOST} backup failed."

echo ""

done

echo "All servers processed...................."

--------------------------------------------------------------------------------------------------------------

SAVEVG Backup:
savevg is an AIX command used to create a backup of a volume group (VG), including all logical volumes and data in that VG.

Backing up entire volume groups before making changes.
Migrating volume groups.
Disaster recovery.

Basic syntax:
# savevg -f <backup_file_path> <vgname>
-f <backup_file_path>: Specifies the path and filename where the VG backup will be saved.
<vgname>: Name of the volume group you want to back up.

Tar Backup:
tar (tape archive) bundles multiple files/directories into a single archive file.
Often used with compression (gzip or bzip2) to save space.

Basic tar backup command
To create a backup archive of a directory, for example /home:
# tar -cvf /backup/home_backup_$(date +%Y%m%d).tar /home
-c = create a new archive
-v = verbose (lists files as they're archived)
-f = specifies the filename of the archive

Compressing the tar archive with gzip
# tar -czvf /backup/home_backup_$(date +%Y%m%d).tar.gz /home
-z = compress the archive using gzip
Compressing the tar archive with bzip2 (better compression)
# tar -cjvf /backup/home_backup_$(date +%Y%m%d).tar.bz2 /home
-j = compress using bzip2

Extracting from a tar archive

Without compression:
# tar -xvf archive.tar
With gzip compression:
# tar -xzvf archive.tar.gz
With bzip2 compression:
# tar -xjvf archive.tar.bz2

Example: Backup /etc directory to compressed archive
# tar -czvf /backup/etc_backup_$(date +%Y%m%d).tar.gz /etc

Unix System Administrator

IBM NIM

Network Installation Manager (NIM) is an IBM tool designed to automate the installation, configuration, and maintenance of AIX operating systems across multiple machines over a network.

Centralized management of all AIX installations
Standardized deployment of operating systems, updates, and patches
Automated system backups and recovery
Support for both diskless and disk-based clients

NIM Master:

The NIM Master is the central hub of the entire setup. It stores and manages all resources needed for client installations or maintenance. Key resources include:

LPP_SOURCE – The AIX installation files and updates.
SPOT – A bootable temporary environment for network installations.
MKSYSB – Full system backup images of clients.
CONFIG / SCRIPT – Automation scripts and configuration files to standardize client setups.

The Master communicates with clients primarily over TCP port 475, which is reserved for NIM protocol operations. This ensures commands, coordination, and status updates flow reliably between Master and Client.

Network Services:

Network services facilitate client booting and resource access:

BOOTP/DHCP – Assigns IP addresses and provides boot parameters to clients. Diskless and disk-based clients both request their network configuration from the Master at startup.
TFTP (Trivial File Transfer Protocol) – Transfers the SPOT boot image from the NIM Master to clients. This happens during the network boot phase.
NFS (Network File System) – Allows clients to mount NIM resources like LPP_SOURCE, MKSYSB, CONFIG, and SCRIPT without storing them locally.

NIM Clients:

There are two types of clients in a NIM environment:

Diskless Clients – Boot entirely over the network without using a local disk. They rely on SPOT and NFS resources to run the OS.
Disk-based Clients – Standard AIX systems that use local disks but still boot over the network for installation or updates.

The client workflow follows this pattern:

Power-On / Network Boot – Client sends a BOOTP/DHCP request.
Boot Image Transfer – TFTP downloads SPOT from the NIM Master.
Resource Mounting – NFS mounts resources needed for installation or updates.
Installation / Restore / Update – Master coordinates the process over TCP 475.
Reboot – Once completed, the client boots from its local disk (if disk-based) with a fully configured AIX system.

Data Flow:

BOOTP/DHCP → IP and boot info assigned
TFTP → SPOT boot image sent to client
NFS → Resources mounted and accessed for installation
TCP 475 → NIM commands, status updates, and session management

The NIM installation process follows these steps:

Client Power-On – Client broadcasts BOOTP/DHCP request
IP & Boot Info Assignment – NIM Master responds with IP configuration and SPOT location
Boot Image Transfer – Client downloads SPOT image via TFTP
Resource Mounting – Client mounts LPP_SOURCE, MKSYSB, CONFIG, SCRIPT via NFS
Installation / Maintenance – NIM Master coordinates installation or restore over TCP port 475
Client Reboot – Client boots from local disk as a fully configured AIX system

Implementing NIM provides several advantages:

Centralized Management: Manage all AIX systems from a single NIM Master
Automation: Eliminate manual installations and configurations
Scalability: Deploy OS across dozens or hundreds of systems simultaneously
System Backup & Restore: Use MKSYSB images for fast disaster recovery
Consistency: Standardized resources minimize configuration drift
Faster Deployment: Network-based booting speeds up installations
Reduced Human Error: Scripted deployments reduce mistakes
Flexible Updates: Apply patches and updates centrally
Support for Diskless Clients: Ideal for test or thin-client environments
Cost Efficiency: Reduce manual labor and installation media costs

Setting Up NIM Master

Prerequisites

Hardware: IBM Power System (Power7/8/9/10), 4–8 GB RAM, 20 GB disk
OS: Supported AIX version (7.1, 7.2, 7.3)
NIM Packages: nim_master, bos.sysmgt.nim.master, bos.sysmgt.nim.spot
Network: Static IP, NFS configured, BOOTP/DHCP ready
Time Synchronization: NTP or chronyd recommended

Installation Steps

Install AIX on the system designated as NIM Master

Install required NIM packages:

# installp -agXd /mnt/lpp_7200-04-02/installp/ppc bos.sysmgt.nim.master bos.sysmgt.nim.spot bos.sysmgt.nim.client

Initialize NIM Master:

# smitty nim --> Configure the NIM Environment --> Advanced Configuration --> Initialize the NIM Master Only

Configure NFS exports:
# startsrc -g nfs

# lssrc -g nfs

Subsystem Group PID Status

biod nfs 5112226 active

rpc.lockd nfs 4325836 active

nfsd nfs 6750656 active

rpc.mountd nfs 6554006 active

rpc.statd nfs 7209330 active

nfsrgyd nfs inoperative

gssd nfs inoperative

# vi /etc/exports

/export/lpp_source -rw,anon=0
/export/spot -rw,anon=0

# exportfs -ua

# exportfs -va

Configure BOOTP/DHCP as needed

Verify TCP Port 475 is open and NIM daemons are running

Defining NIM Resources

NIM relies on resources to manage clients:

LPP_SOURCE: Installation files and patches
SPOT: Bootable client environment
MKSYSB: System backup image
CONFIG/SCRIPT: Configuration and automation scripts

Example to define an LPP source:
Example:

Define LPP_SOURCE from DVD 1

# nim -o define -t lpp_source -a server=master -a source=/mnt/aix_7200-04-02-2027_1of2_072020.iso -a location=/export/lpp_source/lpp_7200-04-02 -a packages=all lpp_7200-04-02

Add Images from DVD 2

# nim -o update -a source=/mnt/aix_7200-04-02-2027_2of2_072020.iso -a packages=all lpp_7200-04-02

Create a SPOT from LPP_SOURCE:

# nim -o define -t spot -a server=master -a source=lpp_7200-04-02 -a location=/export/spot spot_7200-04-02

Client Operations

Install a client using SPOT & LPP_SOURCE:

# nim -o bos_inst -a spot=spot_7200-04-02 -a lpp_source=lpp_7200-04-02 <client>

Install from MKSYSB backup:

# nim -o bos_inst -a source=mksysb -a spot=spot_7200-04-02 -a mksysb=mksysb_backup <client>

Reset a client:

# nim -F -o reset <client>

NIM also supports advanced tasks like alternate disk migration and spot customization for patching or updates.

Updating LPP Source and SPOT

To update an existing LPP source with a new TL/SP:
Example:

# nim -o update -a packages=all -a source=/aix/AIX_v7.3_Install_7300-03-00-2446_LCD8299301.iso lpp_7300-03-00

Customize SPOT for alternative disk installation:

# nim -o cust -a lpp_source=lpp_7300-03-00 -a filesets=bos.alt_disk_install.rte spot_7300-03-00

# nim -o cust -a filesets=bos.alt_disk_install.boot_images -a lpp_source=lpp_7300-03-00 spot_7300-03-00

Unix System Administrator

PowerHA Cluster Manually Startup

PowerHA cluster manually start using below mention script, including steps to:

Check cluster IP
Check network interface and assign alias if needed
Vary on cluster volume groups (VGs)
Mount the filesystems

Here the below script
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
#!/bin/ksh

echo "Check cluster IP Address......................................."
cltopinfo

echo "Check Network Interfaces......................................."
ifconfig -a

read -p "Please Enter the NIC card, Virtual IP Address & Subnet Mask (space separated): " niccard virtualip subnetmsk

echo "NIC Card: $niccard"
echo "Virtual IP: $virtualip"
echo "Subnet Mask: $subnetmsk"

echo "Adding Alias IP........................................................"

ifconfig $niccard alias $virtualip netmask $subnetmsk up
if [ $? -ne 0 ]; then
echo "Failed to add alias IP on $niccard"
exit 1
fi

echo "Varyon cluster volume groups and Mount Filesystems.............."

# Extract volume groups related to cluster resources

vgs=$(clshowres | grep "Volume Group" | grep "vg" | awk '{print $NF}')
if [ -z "$vgs" ]; then
echo "No volume groups found in cluster resources"
exit 1
fi

for vg in $vgs
do
echo "Varyon volume group: $vg"
varyonvg -O $vg
if [ $? -ne 0 ]; then
echo "Failed to varyon volume group $vg"
exit 1
fi

# List jfs2 filesystems in this VG (exclude jfs2log)
FS_LIST=$(lsvg -l $vg | awk '/jfs2/ && !/jfs2log/ {print $7}')
if [ -z "$FS_LIST" ]; then
echo "No JFS2 filesystems found in volume group $vg"
continue
fi
for fs in $FS_LIST
do
echo "Mounting filesystem: $fs"
mount $fs
if [ $? -ne 0 ]; then
echo "Failed to mount filesystem $fs"
exit 1
fi
done
done

echo "Mount any remaining filesystems from /etc/filesystems......."
mount -a
if [ $? -ne 0 ]; then
echo "Failed to mount filesystems"
exit 1
fi

echo "PowerHA Cluster Manual Start Script Completed..............."

Unix System Administrator

Pages

AIX PowerHA(HACMP) Cluster Overview

AIX mksysb Backup

AIX Backup

IBM NIM

PowerHA Cluster Manually Startup