Tracking Partitioned Server's Aggregated Utilization
Update: June 16, 2008
Bug fix to display AIX adapters stat on newer AIX 5.3 TL's.
This program currently supports nmon 9,10,11. Version support is coming soon.
Update: November 2006
This is the third of a series of tips that illustrate how to automate the collection and display of nmon
performance data from multiple servers. This tip extends the capability of the previous tips by adding CPU, memory, virtual I/O aggregation across partitions residing on the same physical server.
This tip is targeted primarily for micropartitioned systems. However, it may be of broader interest for the following new features:
- More flexibility in choice of charts (replaced the nmon2rrd utility with Perl)
- Displays non-default AIX settings for ease of management.
- Displays Change control logs for AIX settings and hardware configuration.
- Work Load Manager - displays absolute CPU utilization by class (useful for micropartitions, where %utilization is meaningless)
- Centralized rrdtool database for easier data extraction so you can write your own programs. (The nmon2rrd tool created a separate rrdtool database for each nmon file.)
- Uses less disk space - removed duplicate databases (daily and long term)
I originally had two objectives in creating this tool
- Automate the creation of daily nmon charts
- Aggregate CPU and memory usage across multiple partitions on a micropartitioned server.
The scope grew to include the aggregation of virtual I/O. However, there are some limitations in this approach (like how to handle mulitple VIO servers). So be sure to understand the limitations listed at the bottom of this page. Otherwise, I've found this to be a very useful tool for tracking performance.
Process Overview
This tool organizes and creates charts on a centralized server using nmon data from multiple servers. Each server (standalone, LPAR, micropartition) uses "nmon -f" to collect daily performance. At the end of the collection period, the nmon file is transfered to a staging directory on a centralized web server. (I leave the details of the data transfer up to you.) On the web server, the "nmon2web.pl" script organizes the data by server and stores it in "rrdtool" database. It also creates the daily web pages.
I've tried to automate the process. For example, to add a new server, simply put the new nmon file in the web server's staging directory. The "nmon2web.pl" will figure out that this is a new server, and will create the necessary directories, rrdtool databases, etc. It will also add the new server to the web page.
- All Servers: The "nmon" program collects performance data on AIX LPAR's, micropartitions and standalone servers.
- Run "nmon -f" as a "cron" job, which outputs data to a file
- I recommended nmon sample interval is 10-20 minutes.
- There are no restrictions on the sample length, but I recommend 1-24 hours. The one hour size allows you to view performance over the day, but it creates a lot more files.
- The "nmon" output files are forwarded to a staging directory on a central web server.
- If you are using a partitioned server, turn on the "Allow shared processor pool utilization authority" on all partitions. This is done on the HMC by right clicking the partition name (not the partition's profile!). Choose "Properties". Choose the "Hardware" tab, then the "Processor and Memory" tab.
- Web Server: The "nmon2web.pl" script processes the nmon files
- Organizes web pages by server's serial number, partition name and type (dedicated|micropartition)
- Automatically adds new servers
- Stores data in a "rrdtool" database.
- Creates daily and long term performance charts for individual servers
- Logs configuration and tuning changes
- Aggregated charts are created dynamically using the "nmon2web.cgi" script
- PC Browser: Point browser to the "index.html" page on the web server
- Daily and long term performance charts
- Aggregated utilization across all partition on a physical server
- Lists configuration, change logs, and non-default AIX settings
Installation Steps for Servers
- Install nmon performance monitor tool (V11 is preferred, but V10 will work, V9 will work, sorta)
- Use cron to automate nmon data collection.
# following cron entry will run nmon with a 10 minute sample rate, starting
# at 00:01 for 24 hours:
#
01 00 * * * (cd /system_dir/nmon/HOSTNAME; /usr/local/bin/nmon -x)
- Automate upload of the nmon files to web server. (I run ftp as a cron job)
Installation Steps for Web Server
- Comment: My test web server is a Linux on Power micropartition. It should work on AIX web servers as well.
- Install the "rrdtool"
- Unpack the nmon2web.tar.gz (gzip -dc nmon2web.tar.gz |tar -xvf-)
- Install nmon2web.cgi
- Move nmon2web.cgi to web servers cgi directory
- Make executable chmod a+x nmon2web.cgi
- Change $DIRECTORY and $WEB_DIR variables to reflect
$DIRECTORY="/home/baspence/public_html/nmon";
$WEB_DIR='/~baspence/nmon';
- Install nmon2web.pl
- Move to /usr/local/bin (or equivalent)
- Make executable chmod a+x nmon2web.cgi
- Customize directory and database retention variables
$NMON_DIR="/home/baspence/nmon"; # source of nmon files
$HTTP_DIR="/home/baspence/public_html/nmon"; # Absolute path to index.html
$DB_MONTHS=36; # rrdtool: number of months to retain data before rrd wraps
- Add cron job to execute this script. This entry runs the script every day at 1 AM
00 01 * * * (/usr/local/bin/nmon2web.pl)
- Create $HTTP_DIR and $NMON_DIR directories
- Grant write permission on $HTTP_DIR to nmon2web.pl and nmon2web.cgi programs
- Move index.html to $HTTP_DIR (chmod a+r index.html)
Comments
Aggregating CPU and memory are relatively straight forward. However, aggregating virtual I/O and ethernet is more challenging. By default the nmon2web.cgi program aggregates virtual utilization by summin all vscsi adapters across all partitions (LPAR and micropartition). For ethernet, the program sums en0 traffic only on micropartitions. The problem is that the program could double count vscsi workload, or assume the wrong ethernet interface.
You can specify your virtual scsi and ethernet configuration by creating the file $SYSTEM_DIR/Shared/sharedpool/virtual.cfg. There's a template file in the same directory that explains how to configure.
Known Limitations
This program is not backward compatible with Parts 1 & 2. The file systems have been reorganized, the rrdtool databases centralized. The nmon data should be reloaded from scratch.
Adding or deleting servers can cause blank aggregated charts. The underlying "rrdtool" doesn't handle missing data when aggregating data. So if you add a add/remove a micropartition on a server, you may get blank charts when you try to display aggregated performance over a time period where the server is missing.
Adding new devices (scsi, fcsi, ethernet) may cause blank graphs. Same reason as above. New devices will have to be added manually to the appropriate rrd file.
Short nmon sampling intervals increase disk requirements on the web server. For example, a 1 minute interval and 3 year data retention (default) used about 500 MB of disk space (reserved at setup time by the rrdtool). I recommend using an nmon sampling interval of 10-20 minutes.
Linux Empty charts for "system calls". (nmon produces negative numbers)
Linux on Power If running on a micropartitioned system with AIX, the CPU free pool may look 100% busy. Displays as a standalone server (nmon for pLinux doesn't report the serial number, and consequently doesn't get assigned to a server).
Other Please report other issues. I have a limited "sandbox" and
have not tested every combination of hardware/operating system.
Revision History
First Release: June 2006
July 2006 Bug fix
- Typo's = wlmcpu.rdd -> wlmcpu.rrd, $$WEIGHT -> $WEIGHT
- Modified rrd_update
September 2006
- Added support for nmon 11d, 11e. Fixes missing LPAR charts resulting
from new fields being added to the nmon output.
November 2, 2006
- Added *.csv files for nmon input (*.nmon and *.csv)
- Modified LPAR charts to include free pool (thanks to David Wong)
- Replace missing data with 0's
- Added support for Linux (Limited. you'll notice missing charts...nmon for Linux doesn't provide the same amount of information as it does for AIX).