[68]We've truncated the leading .iso.org.dod.internet.private.enterprises.hp.nm.system.general to the walk results for space reasons.
Let's think about how we'd write a program that checks for available disk space. At first glance, it looks like this will be easy. But this table contains a number of objects that aren't filesystems in the normal sense; /proc, for example, provides access to the processes running on the system and doesn't represent storage. This raises problems if we start polling for free blocks: /proc isn't going to have any free blocks, and /dev/fd, which represents a floppy disk, will have free blocks only if a disk happens to be in the drive. You'd expect /home to behave like a normal filesystem, but on this server it's automounted, which means that its behavior is unpredictable; if it's not in use, it might not be mounted. Therefore, if we polled for free blocks using the fileSystem.fileSystemTable.fileSystemEntry.fileSystemBavail object, the last five instances might return 0 under normal conditions. So the results we'd get from polling all the entries in the filesystem table aren't meaningful without further interpretation. At a minimum, we need to figure out which filesystems are important to us and which aren't. This is probably going to require being clever about the instance numbers. When I discovered this problem, I noticed that all the filesystems I wanted to check happened to have instance numbers with the same leading digits; i.e., fileSystemDir.14680064.1, fileSystemDir.14680067.1, fileSystemDir.14680068.1, etc. That observation proved to be less useful than it seemed -- with time, I learned that not only do other servers have different leading instance numbers, but that on any server the instance numbers could change. Even if the instance number changes, though, the leading instance digits seem to stay the same for all disks or filesystems of the same type. For example, disk arrays might have instance numbers like fileSystemDir.12312310.1, fileSystemDir.12312311.1, fileSystemDir.12312312.1, and so on. Your internal disks might have instance numbers like fileSystemDir.12388817.1, fileSystemDir.12388818.1, fileSystemDir.12388819.1, and so on. So, working with the instance numbers is possible, but painful -- there is still nothing static that can be easily polled. There's no easy way to say "Give me the statistics for all the local filesystems," or even "Give me the statistics for /usr." I was forced to write a program that would do a fair amount of instance-number processing, making guesses based on the behavior I observed. I had to use snmpwalk to figure out the instance numbers for the filesystems I cared about before doing anything more interesting. By comparing the initial digits of the instance numbers, I was able to figure out which filesystems were local, which were networked, and which were "special purpose" (like /proc). Here's the result:[root][nms] /opt/OV/local/bin/disk_space> snmpwalk spruce \ .1.3.6.1.4.1.11.2.3.1.2.2.1.10 fileSystem.fileSystemTable.fileSystemEntry.fileSystemDir.14680064.1 : DISPLAY STRING- (ascii): / fileSystem.fileSystemTable.fileSystemEntry.fileSystemDir.14680067.1 : DISPLAY STRING- (ascii): /var fileSystem.fileSystemTable.fileSystemEntry.fileSystemDir.14680068.1 : DISPLAY STRING- (ascii): /export fileSystem.fileSystemTable.fileSystemEntry.fileSystemDir.14680069.1 : DISPLAY STRING- (ascii): /opt fileSystem.fileSystemTable.fileSystemEntry.fileSystemDir.14680070.1 : DISPLAY STRING- (ascii): /usr fileSystem.fileSystemTable.fileSystemEntry.fileSystemDir.41156608.1 : DISPLAY STRING- (ascii): /proc fileSystem.fileSystemTable.fileSystemEntry.fileSystemDir.41680896.1 : DISPLAY STRING- (ascii): /dev/fd fileSystem.fileSystemTable.fileSystemEntry.fileSystemDir.42991617.1 : DISPLAY STRING- (ascii): /net fileSystem.fileSystemTable.fileSystemEntry.fileSystemDir.42991618.1 : DISPLAY STRING- (ascii): /home fileSystem.fileSystemTable.fileSystemEntry.fileSystemDir.42991619.1 : DISPLAY STRING- (ascii): /xfn
The script contains a handful of useful features:#!/usr/local/bin/perl # filename: polling.pl # options: # -min n : send trap if less than n 1024-byte blocks free # -table f : table of servers to watch (defaults to ./default) # -server s : specifies a single server to poll # -inst n : number of leading instance-number digits to compare # -debug n : debug level $|++; $SNMPWALK_LOC = "/opt/OV/bin/snmpwalk -r 5"; $SNMPGET_LOC = "/opt/OV/bin/snmpget"; $HOME_LOC = "/opt/OV/local/bin/disk_space"; $LOCK_FILE_LOC = "$HOME_LOC/lock_files"; $GREP_LOC = "/bin/grep"; $TOUCH_LOC = "/bin/touch"; $PING_LOC = "/usr/sbin/ping"; # Ping Location $PING_TIMEOUT = 7; # Seconds to wait for a ping $MIB_C = ".1.3.6.1.4.1.11.2.3.1.2.2.1.6"; # fileSystemBavail $MIB_BSIZE = ".1.3.6.1.4.1.11.2.3.1.2.2.1.7"; # fileSystemBsize $MIB_DIR = ".1.3.6.1.4.1.11.2.3.1.2.2.1.10"; # fileSystemDir while ($ARGV[0] =~ /^-/) { if ($ARGV[0] eq "-min") { shift; $MIN = $ARGV[0]; } # In 1024 blocks elsif ($ARGV[0] eq "-table") { shift; $TABLE = $ARGV[0]; } elsif ($ARGV[0] eq "-server") { shift; $SERVER = $ARGV[0]; } elsif ($ARGV[0] eq "-inst") { shift; $INST_LENGTH = $ARGV[0]; } elsif ($ARGV[0] eq "-debug") { shift; $DEBUG = $ARGV[0]; } shift; } ################################################################# ########################## Begin Main ######################### ################################################################# $ALLSERVERS = 1 unless ($SERVER); $INST_LENGTH = 5 unless ($INST_LENGTH); $TABLE = "default" unless ($TABLE); open(TABLE,"$HOME_LOC/$TABLE") || die "Can't Open File $TABLE"; while($LINE=<TABLE>) { if ($LINE ne "\n") { chop $LINE; ($HOST,$IGNORE1,$IGNORE2,$IGNORE3) = split(/\:/,$LINE); if (&ping_server_bad("$HOST")) { warn "Can't Ping Server :$HOST:" unless (!($DEBUG)); } else { &find_inst; if ($DEBUG > 99) { print "HOST:$HOST: IGNORE1 :$IGNORE1: IGNORE2 :$IGNORE2: IGNORE3 :$IGNORE3:\n"; print "Running :$SNMPWALK_LOC $HOST $MIB_C \| $GREP_LOC \.$GINST:\n"; } $IGNORE1 = "C1ANT5MAT9CHT4HIS" unless ($IGNORE1); # If we don't have anything then let's set $IGNORE2 = "CA2N4T6M8A1T3C5H7THIS" unless ($IGNORE2); # to something that we can never match. $IGNORE3 = "CAN3TMA7TCH2THI6S" unless ($IGNORE3); if (($SERVER eq "$HOST") || ($ALLSERVERS)) { open(WALKER,"$SNMPWALK_LOC $HOST $MIB_C \| $GREP_LOC \.$GINST |") || die "Can't Walk $HOST $MIB_C\n"; while($WLINE=<WALKER>) { chop $WLINE; ($MIB,$TYPE,$VALUE) = split(/\:/,$WLINE); $MIB =~ s/\s+//g; $MIB =~ /(\d+\.\d+)$/; $INST = $1; open(SNMPGET,"$SNMPGET_LOC $HOST $MIB_DIR.$INST |"); while($DLINE=<SNMPGET>) { ($NULL,$NULL,$DNAME) = split(/\:/,$DLINE); } $DNAME =~ s/\s+//g; close SNMPGET; open(SNMPGET,"$SNMPGET_LOC $HOST $MIB_BSIZE.$INST |"); while($BLINE=<SNMPGET>) { ($NULL,$NULL,$BSIZE) = split(/\:/,$BLINE); } close SNMPGET; $BSIZE =~ s/\s+//g; $LOCK_RES = &inst_found; $LOCK_RES = "\[ $LOCK_RES \]"; print "LOCK_RES :$LOCK_RES:\n" unless ($DEBUG < 99); $VALUE = $VALUE * $BSIZE / 1024; # Put it in 1024 blocks if (($DNAME =~ /.*$IGNORE1.*/) || ($DNAME =~ /.*$IGNORE2.*/) || ($DNAME =~ /.*$IGNORE3.*/)) { $DNAME = "$DNAME "ignored""; } else { if (($VALUE <= $MIN) && ($LOCK_RES eq "\[ 0 \]")) { &write_lock; &send_snmp_trap(0); } elsif (($VALUE > $MIN) && ($LOCK_RES eq "\[ 1 \]")) { &remove_lock; &send_snmp_trap(1); } } $VALUE = $VALUE / $BSIZE * 1024; # Display it as the # original block size write unless (!($DEBUG)); } # end while($WLINE=<WALKER>) } # end if (($SERVER eq "$HOST") || ($ALLSERVERS)) } # end else from if (&ping_server_bad("$HOST")) } # end if ($LINE ne "\n") } # end while($LINE=<TABLE>) ################################################################# ###################### Begin SubRoutines ###################### ################################################################# format STDOUT_TOP = Server MountPoint BlocksLeft BlockSize MIB LockFile --------- ---------------- ------------ ----------- --------- ---------- . format STDOUT = @<<<<<<<< @<<<<<<<<<<<<<<< @<<<<<<<<<<< @<<<<<<<<<< @<<<<<<<< @<<<<<<<<< $HOST, $DNAME, $VALUE, $BSIZE, $INST, $LOCK_RES . sub inst_found { if (-e "$LOCK_FILE_LOC/$HOST\.$INST") { return 1; } else { return 0; } } sub remove_lock { if ($DEBUG > 99) { print "Removing Lockfile $LOCK_FILE_LOC/$HOST\.$INST\n"; } unlink "$LOCK_FILE_LOC/$HOST\.$INST"; } sub write_lock { if ($DEBUG > 99) { print "Writing Lockfile $TOUCH_LOC $LOCK_FILE_LOC/$HOST\.$INST\n"; } system "$TOUCH_LOC $LOCK_FILE_LOC/$HOST\.$INST"; } ################################################################# ## send_snmp_trap ## #################### ## # This subroutine allows you to send diff traps depending on the # passed parm and gives you a chance to send both good and bad # traps. # # $1 - integer - This will be added to the specific event ID. # # If we created two traps: # 2789.2500.0.1000 = Major # 2789.2500.0.1001 = Good # # If we declare: # $SNMP_SPECIFIC_TRAP = "1000"; # # We could send the 1st by using: # send_snmp_trap(0); # Here is the math (1000 + 0 = 1000) # And to send the second one: # send_snmp_trap(1); # Here is the math (1000 + 1 = 1001) # # This way you could set up multiple traps with diff errors using # the same function for all. # ## ################################################################# sub send_snmp_trap { $TOTAL_TRAPS_CREATED = 2; # Let's do some checking/reminding # here. This number should be the # total number of traps that you # created on the nms. $SNMP_ENTERPRISE_ID = ".1.3.6.1.4.1.2789.2500"; $SNMP_SPECIFIC_TRAP = "1500"; $PASSED_PARM = $_[0]; $SNMP_SPECIFIC_TRAP += $PASSED_PARM; $SNMP_TRAP_LOC = "/opt/OV/bin/snmptrap"; $SNMP_COMM_NAME = "public"; $SNMP_TRAP_HOST = "nms"; $SNMP_GEN_TRAP = "6"; chop($SNMP_TIME_STAMP = "1" . `date +%H%S`); $SNMP_EVENT_IDENT_ONE = ".1.3.6.1.4.1.2789.2500.$SNMP_SPECIFIC_TRAP.1"; $SNMP_EVENT_VTYPE_ONE = "octetstringascii"; $SNMP_EVENT_VAR_ONE = "$DNAME"; $SNMP_EVENT_IDENT_TWO = ".1.3.6.1.4.1.2789.2500.$SNMP_SPECIFIC_TRAP.2"; $SNMP_EVENT_VTYPE_TWO = "integer"; $SNMP_EVENT_VAR_TWO = "$VALUE"; $SNMP_EVENT_IDENT_THREE = ".1.3.6.1.4.1.2789.2500.$SNMP_SPECIFIC_TRAP.3"; $SNMP_EVENT_VTYPE_THREE = "integer"; $SNMP_EVENT_VAR_THREE = "$BSIZE"; $SNMP_EVENT_IDENT_FOUR = ".1.3.6.1.4.1.2789.2500.$SNMP_SPECIFIC_TRAP.4"; $SNMP_EVENT_VTYPE_FOUR = "octetstringascii"; $SNMP_EVENT_VAR_FOUR = "$INST"; $SNMP_EVENT_IDENT_FIVE = ".1.3.6.1.4.1.2789.2500.$SNMP_SPECIFIC_TRAP.5"; $SNMP_EVENT_VTYPE_FIVE = "integer"; $SNMP_EVENT_VAR_FIVE = "$MIN"; $SNMP_TRAP = "$SNMP_TRAP_LOC \-c $SNMP_COMM_NAME $SNMP_TRAP_HOST $SNMP_ENTERPRISE_ID \"$HOST\" $SNMP_GEN_TRAP $SNMP_SPECIFIC_TRAP $SNMP_TIME_STAMP $SNMP_EVENT_IDENT_ONE $SNMP_EVENT_VTYPE_ONE \"$SNMP_EVENT_VAR_ONE\" $SNMP_EVENT_IDENT_TWO $SNMP_EVENT_VTYPE_TWO \"$SNMP_EVENT_VAR_TWO\" $SNMP_EVENT_IDENT_THREE $SNMP_EVENT_VTYPE_THREE \"$SNMP_EVENT_VAR_THREE\" $SNMP_EVENT_IDENT_FOUR $SNMP_EVENT_VTYPE_FOUR \"$SNMP_EVENT_VAR_FOUR\" $SNMP_EVENT_IDENT_FIVE $SNMP_EVENT_VTYPE_FIVE \"$SNMP_EVENT_VAR_FIVE\""; if (!($PASSED_PARM < $TOTAL_TRAPS_CREATED)) { die "ERROR SNMPTrap with a Specific Number \> $TOTAL_TRAPS_CREATED\nSNMP_TRAP:$SNMP_TRAP:\n"; } # Sending a trap using Net-SNMP # #system "/usr/local/bin/snmptrap $SNMP_TRAP_HOST $SNMP_COMM_NAME #$SNMP_ENTERPRISE_ID '' $SNMP_GEN_TRAP $SNMP_SPECIFIC_TRAP '' #$SNMP_EVENT_IDENT_ONE s \"$SNMP_EVENT_VAR_ONE\" #$SNMP_EVENT_IDENT_TWO i \"$SNMP_EVENT_VAR_TWO\" #$SNMP_EVENT_IDENT_THREE i \"$SNMP_EVENT_VAR_THREE\" #$SNMP_EVENT_IDENT_FOUR s \"$SNMP_EVENT_VAR_FOUR\""; #$SNMP_EVENT_IDENT_FIVE i \"$SNMP_EVENT_VAR_FIVE\""; # Sending a trap using Perl # #use SNMP_util "0.54"; # This will load the BER and SNMP_Session for us #snmptrap("$SNMP_COMM_NAME\@$SNMP_TRAP_HOST:162", "$SNMP_ENTERPRISE_ID", #mylocalhostname, $SNMP_GEN_TRAP, $SNMP_SPECIFIC_TRAP, #"$SNMP_EVENT_IDENT_ONE", "string", "$SNMP_EVENT_VAR_ONE", #"$SNMP_EVENT_IDENT_TWO", "int", "$SNMP_EVENT_VAR_TWO", #"$SNMP_EVENT_IDENT_THREE", "int", "$SNMP_EVENT_VAR_THREE", #"$SNMP_EVENT_IDENT_FOUR", "string", "$SNMP_EVENT_VAR_FOUR", #"$SNMP_EVENT_IDENT_FIVE", "int", "$SNMP_EVENT_VAR_FIVE"); # Sending a trap using OpenView's snmptrap (using VARs from above) # if($SEND_SNMP_TRAP) { print "ERROR Running SnmpTrap Result "; print ":$SEND_SNMP_TRAP: :$SNMP_TRAP:\n" } sub find_inst { open(SNMPWALK2,"$SNMPWALK_LOC $HOST $MIB_DIR |") || die "Can't Find Inst for $HOST\n"; while($DLINE=<SNMPWALK2>) { chomp $DLINE; ($DIRTY_INST,$NULL,$DIRTY_NAME) = split(/\:/,$DLINE); $DIRTY_NAME =~ s/\s+//g; # Lose the whitespace, folks! print "DIRTY_INST :$DIRTY_INST:\nDIRTY_NAME :$DIRTY_NAME:\n" unless (!($DEBUG>99)); if ($DIRTY_NAME eq "/") { $DIRTY_INST =~ /fileSystemDir\.(\d*)\.1/; $GINST = $1; $LENGTH = (length($GINST) - $INST_LENGTH); while ($LENGTH--) { chop $GINST; } close SNMPWALK; print "Found Inst DIRTY_INST :$DIRTY_INST: DIRTY_NAME\ :$DIRTY_NAME: GINST :$GINST:\n" unless (!($DEBUG > 99)); return 0; } } close SNMPWALK2; die "Can't Find Inst for HOST :$HOST:"; } sub ping_server_bad { local $SERVER = $_[0]; $RES = system "$PING_LOC $SERVER $PING_TIMEOUT \> /dev/null"; print "Res from Ping :$RES: \- :$PING_LOC $SERVER:\n" unless (!($DEBUG)); return $RES; }
[69]There have been a few times that we have missed the fact that a system has filled up because a trap was lost during transmission. Using cron, we frequently delete everything in the lock directory. This resubmits the entries, if any, at that time.
Now we can run the script with the -debug option to show us a table of the results. The following command asks for all filesystems on the server db_serv0 with fewer than 50,000 blocks (50 MB) free:db_serv0 db_serv1 db_serv2
Notice that we didn't need to specify a table explicitly; because we omitted the -table option, the polling.pl script used the default file we put in the current directory. The -server switch let us limit the test to the server named db_serv0; if we had omitted this option the script would have checked all servers within the default table. If the free space on any of the filesystems falls under 50,000 1024-byte blocks, the program sends a trap and writes a lockfile with the instance number. Because SNMP traps use UDP, they are unreliable. This means that some traps may never reach their destination. This could spell disaster -- in our situation, we're sending traps to notify a manager that a filesystem is full. We don't want those traps to disappear, especially since we've designed our program so that it doesn't send duplicate notifications. One workaround is to have cron delete some or all of the files in the lock directory. We like to delete everything in the lock directory every hour; this means that we'll get a notification every hour until some free storage appears in the filesystem. Another plausible policy is to delete only the production-server lockfiles. With this policy, we'll get hourly notification about filesystem capacity problems on the server we care about most; on other machines (e.g., development machines, test machines), we will get only a single notification. Let's say that the filesystem /db1 is a test system and we don't care if it fills up. We can ignore this filesystem by specifying it in our table. We can list up to three filesystems we would like to ignore after the server name (which must be followed by a ":"):$ /opt/OV/local/bin/disk_space/polling.pl -min 50000 -server db_serv0 -debug 1 Res from Ping :0: - :/usr/sbin/ping db_serv0: Server MountPoint BlocksLeft BlockSize MIB LockFile ---------- ----------------- ---------- --------- --------------- -------- db_serv0 / 207766 1024 38010880.1 [ 0 ] db_serv0 /usr 334091 1024 38010886.1 [ 0 ] db_serv0 /opt 937538 1024 38010887.1 [ 0 ] db_serv0 /var 414964 1024 38010888.1 [ 0 ] db_serv0 /db1 324954 1024 38010889.1 [ 0 ]
Running the polling.pl script again gives these results:db_serv0:db1
When the /db1 filesystem drops below the minimum disk space, the script will not send any traps or create any lockfiles. Now let's go beyond experimentation. The following crontab entries run our program twice every hour:$ /opt/OV/local/bin/disk_space/polling.pl -min 50000 -server db_serv0 -debug 1 Res from Ping :0: - :/usr/sbin/ping db_serv0: Server MountPoint BlocksLeft BlockSize MIB LockFile ---------- ----------------- ---------- --------- --------------- -------- db_serv0 / 207766 1024 38010880.1 [ 0 ] db_serv0 /usr 334091 1024 38010886.1 [ 0 ] db_serv0 /opt 937538 1024 38010887.1 [ 0 ] db_serv0 /var 414964 1024 38010888.1 [ 0 ] db_serv0 /db1 (ignored) 324954 1024 38010889.1 [ 0 ]
Next we need to define how the traps polling.pl generates should be handled when they arrive at the NMS. Here's the entry in OpenView's trapd.conf file that shows how to handle these traps:4,34 * * * * /opt/OV/bin/polling.pl -min 50000 5,35 * * * * /opt/OV/bin/polling.pl -min 17000 -table stocks_table 7,37 * * * * /opt/OV/bin/polling.pl -min 25000 -table bonds_table -inst 3
These entries define two OpenView events: a DiskSpaceLow event that is used when a filesystem's capacity is below the threshold, and a DiskSpaceNormal event. We place both of these in the Threshold Alarms category; the low disk space event has a severity of Major, while the "normal" event has a severity of Normal. If you're using some other package to listen for traps, you'll have to configure it accordingly.EVENT DiskSpaceLow .1.3.6.1.4.1.2789.2500.0.1500 "Threshold Alarms" Major FORMAT Disk Space For FileSystem :$1: Is Low With :$2: 1024 Blocks Left - Current FS Block Size :$3: - Min Threshold :$5: - Inst :$4: SDESC $1 - octetstringascii - FileSystem $2 - integer - Current Size $3 - integer - Block Size $4 - octetstringascii - INST $5 - integer - Min Threshold Size EDESC # # # EVENT DiskSpaceNormal .1.3.6.1.4.1.2789.2500.0.1501 "Threshold Alarms" Normal FORMAT Disk Space For FileSystem :$1: Is Normal With :$2: 1024 Blocks Left - Current FS Block Size :$3: - Min Threshold :$5: - Inst :$4: SDESC $1 - octetstringascii - FileSystem $2 - integer - Current Size $3 - integer - Block Size $4 - octetstringascii - INST $5 - integer - Min Threshold size EDESC
Copyright © 2002 O'Reilly & Associates. All rights reserved.