Monitoring Network Devices with Nagios

Nagios is a scalable, flexible. and powerful Network Monitoring solution that pairs well with graphing tools such as Cacti or MRTG. In this post I’ll share templates and configuration files to get you started with monitoring Cisco routers, switches, and security devices.

Once you have a few devices configured using the templates in this post, you’ll be able to quickly scale out your deployment using Python, shell scripts, or worst case scenario – a text editor.

Introduction

This post doesn’t include instructions on how to install Nagios, but instead assumes that you followed the Nagios Quick Start Guide and have a working installation complete with plugins and a functioning web interface.

The templates defined in the steps below will allow you to monitor the following:

  1. Cisco IOS Routers and Switches
    • System UpTime
    • 5 Minute CPU Average
    • BGP Peer Sessions
    • Interface Operational Status (Port-channel, Vlan, Physical)
    • SSH Availability
    • IP SLA ICMP Echo Round Trip Time (RTT)
    • IP SLA ICMP Echo Failures
    • Packet Loss and RTT to a Layer 3 Interface
  2. Cisco Nexus Switches
    • System Uptime
    • 5 Minute CPU Average
    • Interface Operational Status (Port-channel, Vlan, Physical)
    • SSH Availability
    • Packet Loss and RTT to Management Interface
  3. Cisco ASA Appliances and Firewall Service Modules (FWSM)
    • System UpTime
    • 5 Minute CPU Average
    • Interface Operational Status (Physical)
    • SSH Availability
    • Packet Loss and RTT to a Layer 3 Interface
    • Total Current Sessions

Overview

We will be following these 6 steps to get Nagios monitoring your network:

  1. Download the check_bgp.pl plugin (Optional)
  2. Add Command Definitions
  3. Create hostgroups.cfg (Optional)
  4. Create hosts.cfg
  5. Define Services to Monitor
  6. Define Interfaces to Monitor

I also included an extra section showing how you can use awk to help generate your service definitions for those who don’t know how to script.

Deployment Example

My Nagios deployment is monitoring over 900 “services” (interfaces, ports, services, sessions, etc) on 175 network devices in one data center.

Here are a few screen shots of how my deployment looks in Nagios:

Host Group Overview

Nagios_Hostgroup_Overview

Host Details for a Core Router

CR1_Host_Detail

Service Group Overview

Nagios_Servicegroup_Overview

Since a single instance of Nagios is monitoring all 175 network devices in this data center I am using Host Groups and Service Groups (both optional) to help organize things.

Instructions

1.  (Optional) Download the check_bgp.pl plugin

If you’re running BGP and want to monitor your peer sessions, I recommend using the check_bgp.pl plugin from the Nagios Exchange. Download it to your plugins directory (mine is /usr/lib/nagios/plugins) with wget and make it executable.

root@nag001:/usr/lib/nagios/plugins# wget -O check_bgp.pl "http://exchange.nagios.org/components/com_mtree/attachment.php?link_id=1555&cf_id=30"
 --2013-01-30 23:21:26--  http://exchange.nagios.org/components/com_mtree/attachment.php?link_id=1555&cf_id=30
 Resolving exchange.nagios.org... 66.228.58.94
 Connecting to exchange.nagios.org|66.228.58.94|:80... connected.
 HTTP request sent, awaiting response... 200 OK
 Length: 10219 (10.0K) [application/octet-stream]
 Saving to: `check_bgp.pl'

100%[=======================================================================================>] 10,219 --.-K/s in 0.07s

2013-01-30 23:21:27 (153 KB/s) - `check_bgp.pl' saved [10219/10219]

root@nag001:/usr/lib/nagios/plugins# chmod +x check_bgp.pl 

2. Add Command Definitions

All of our monitoring requirements are handled by three plugins: check_snmp, check_bgp.pl, and check_tcp. In order to use them for our service checks we must first create custom Command Definitions in the commands.cfg configuration file.

To make these commands as flexible as possible we will include variables that allow arguments to be passed from our hosts.cfg and services.cfg files.

Let’s take the check_snmp plugin as an example. This plugin accepts over 20 different options as explained on its definition page: http://nagiosplugins.org/man/check_snmp

Usage:
check_snmp -H <ip_address> -o <OID> [-w warn_range] [-c crit_range]
[-C community] [-s string] [-r regex] [-R regexi] [-t timeout] [-e retries]
[-l label] [-u units] [-p port-number] [-d delimiter] [-D output-delimiter]
[-m miblist] [-P snmp version] [-L seclevel] [-U secname] [-a authproto]
[-A authpasswd] [-x privproto] [-X privpasswd]

When we use check_snmp to monitor interfaces we will use the -r and -l options. When we use it to monitor IP SLA’s and NAT Translations we will use the -w and -c options. We will also use different OID’s. In order to accommodate all of this with just one command definition we will use three custom variables called $ARG1$, $ARG2, and $ARG3$:

check_snmp -H $HOSTADDRESS$ -C public -o $ARG1$ $ARG2$ $ARG3$

I’ll show you how it works later in this post. For now, just define the custom commands in your commands.cfg file as shown below.

Note: Change “-C public” to match your snmp community name. Also use the path to your plugin directory, which may be different than mine.

## Poll a device using the OID specified as $ARG1$ and apply options specified in $ARG2$ and $ARG3$

define command{
command_name    check_snmp_router
command_line    /usr/lib/nagios/plugins/check_snmp -H $HOSTADDRESS$ -C public -o $ARG1$ $ARG2$ $ARG3$
}

## Call the check_bgp.pl perl script and send the IP Address of the BGP Peer specified in $ARG1$ by using the -p option

define command{
command_name    check_cisco_bgp
command_line    /usr/lib/nagios/plugins/check_bgp.pl -H $HOSTADDRESS$ -C public -p $ARG1$
}

## Telnet to port 22 for each host, expect (-e) to see "SSH" somewhere in the output, then quit (-q) by sending the string "exit"

define command {
command_name     check_cisco_ssh
command_line     /usr/lib/nagios/plugins/check_tcp -H $HOSTADDRESS$ -p 22 -e SSH -q exit
} 

3. (Optional) Create hostgroups.cfg

Decide how you want to logically group your devices and then add the definitions to a file named hostgroups.cfg that you create in your /etc/nagios3/conf.d directory.

Nagios will display your host groups in alphabetical order so if you want to influence how things are displayed you can just include numerals in their names.

 define hostgroup {
hostgroup_name                Routers
}

define hostgroup {
hostgroup_name                Switches
}

define hostgroup {
hostgroup_name                Firewalls
}

define hostgroup {
hostgroup_name                VPN
} 

4. Create hosts.cfg

Create a file called hosts.cfg in /etc/nagios3/conf.d and optionally assign them to the host groups you created in Step 3.

In this example I am using the default template “generic host” – you’ll want to develop a standard template of your own once you are more comfortable with Nagios.

define host {
host_name       core1
alias           core1.domain.com
address         10.0.0.1
use             generic-host
hostgroups      Routers
}

define host {
host_name       fw1
alias           fw1.domain.com
address         10.0.0.254
use             generic-host
hostgroups      Firewall_and_VPN
}

5. Define Services to Monitor

Here are the service definitions to place in your services.cfg file, complete with full OIDs. Just copy, paste, and modify (if necessary).

I will be using the generic-service template for each of these. Replace any OIDs that end with an “X” with the proper unique SNMP identifier from your device. These identifiers can be found with snmpwalk commands.

Note: My code snippets are line wrapping – be sure to include everything on one line in your configuration file.

For Use with All Cisco Devices

; Report the System Uptime
define service {
use                     generic-service
hosts                   *
service_description     System UpTime
check_interval          5    ; This overrides what is specified in the
check_command           check_snmp_args!1.3.6.1.2.1.1.3.0
}

; Check latency and packet loss - specify the warning and critical levels for each
define service {
use                     generic-service
hosts                    *
service_description     PING
check_interval          5
check_command           check_ping!200.0,20%!400.0,40%   ; warning and critical levels for latency, packet loss%
}

; Verify that an SSH connection can be established
define service {
use                     generic-service
hosts                    *
service_description     SSH
check_interval          5
check_command           check_cisco_ssh
}

For Use with Cisco Nexus Devices Only

; Cisco Nexus CPU Avg
define service {
use                     generic-service
hostgroup               Routers
service_description     5 Min CPU Average
check_interval          5
check_command           check_snmp_args!.1.3.6.1.4.1.9.9.109.1.1.1.1.5.1!-l \"5 Minute CPU \% \" -w 50 -c 80 ; -w is warning level, -c is critical
} 

For Use with Cisco Routers and Switches

 ; Cisco IOS CPU Avg
<pre>define service {
use                     generic-service
hostgroup               Routers,Switches,Firewalls,VPN
service_description     5 Min CPU Average
check_interval          5
check_command           check_snmp_router!.1.3.6.1.4.1.9.9.109.1.1.1.1.5.1!-l \"5 Minute CPU \% \" -w 50 -c 80
servicegroups           Memory_and_CPU
}

; Monitor BGP peer session to ISP's
define service {
use                     generic-service
hosts                   core1
service_description     BGP Session: ISP 1
check_interval          5
check_command           check_cisco_bgp!x.x.x.x  ; Insert your BGP peer address here
}

; Monitor the IP SLA ICMP Echo Round Trip Time
define service {
use                     generic-service
hosts                   core1
service_description     IP SLA RTT for ISP 1
check_interval          1
check_command           check_snmp_args!.1.3.6.1.4.1.9.9.42.1.2.10.1.1.X!-l "Last RTT (ms)" -w 1000 -c 2000 ; where X is your IP SLA operation number
}

; Verify our IP SLA ICMP Echo command was successful
define service {
use                     generic-service
hosts                   core1
service_description     IP SLA PING Success for ISP 1
check_interval          1
check_command           check_snmp_router!.1.3.6.1.4.1.9.9.42.1.2.10.1.2.X!-r 1!-l "IP SLA Ping Success" ; where X is your IP SLA operation number
}

For Use with Cisco ASA and FWSM’s

 ; Total Sessions
define service {
use                     generic-service
hostgroup               3.FW_VPN
service_description     Total_Sessions
check_interval          5
check_command           check_snmp_router!.1.3.6.1.4.1.9.9.147.1.2.2.2.1.5.40.6 -l \"Total Current Sessions\"-w 20000 -c 30000
servicegroups           FW_and_VPN
} 

6. Defining Interfaces to Monitor

First, identify which interfaces you want to monitor then do a snmpwalk of the mib-2.interfaces OID to find out how they are identified by SNMP. Here’s an example:

 root@nag001:/etc/nagios3/conf.d# snmpwalk -v2c -c public 10.0.0.1 mib-2.interfaces
.
IF-MIB::ifDescr.1 = STRING: GigabitEthernet0/0
IF-MIB::ifDescr.2 = STRING: GigabitEthernet0/1
.
IF-MIB::ifDescr.5 = STRING: GigabitEthernet0/0.2
IF-MIB::ifDescr.7 = STRING: GigabitEthernet0/1.10
IF-MIB::ifDescr.8 = STRING: GigabitEthernet0/1.11
IF-MIB::ifDescr.9 = STRING: GigabitEthernet0/1.12

The numeral after “IF-MIB::ifDescr.” is what identifies each of your interfaces. So “1” is G0/0, “2” is G0/1″, and so on. Now you can use this information to monitor your interfaces by using the following service template:

 define service {
use                     generic-service
hosts                   your-host-name
service_description     your-interface-description
check_command           check_snmp_router!.1.3.6.1.2.1.2.2.1.8.X!-r 1!-l ifOperStatus
} 

So using the example above, we would have the following directives for monitoring whether our interfaces are UP or DOWN (note that the X in the OIDs have been replaced):

 define service {
use                     generic-service
hosts                   core1
service_description     GigabitEthernet0/0
check_command           check_snmp_router!.1.3.6.1.2.1.2.2.1.8.1!-r 1!-l ifOperStatus
}

define service {
use                     generic-service
hosts                   core1
service_description     GigabitEthernet0/1
check_command           check_snmp_router!.1.3.6.1.2.1.2.2.1.8.2!-r 1!-l ifOperStatus
}

define service {
use                     generic-service
hosts                   core1
service_description     GigabitEthernet0/0.2
check_command           check_snmp_router!.1.3.6.1.2.1.2.2.1.8.5!-r 1!-l ifOperStatus
}

define service {
use                     generic-service
hosts                   core1
service_description     GigabitEthernet0/1.10
check_command           check_snmp_router!.1.3.6.1.2.1.2.2.1.8.7!-r 1!-l ifOperStatus
}

define service {
use                     generic-service
hosts                   core1
service_description     GigabitEthernet0/1.11
check_command           check_snmp_router!.1.3.6.1.2.1.2.2.1.8.8!-r 1!-l ifOperStatus
}

define service {
use                     generic-service
hosts                   core1
service_description     GigabitEthernet0/1.12
check_command           check_snmp_router!.1.3.6.1.2.1.2.2.1.8.9!-r 1!-l ifOperStatus
} 

Those are all the pieces you need for monitoring your network. Now it’s just a matter of adding hosts, their associated services, and restarting Nagios!

Miscellaneous: Using Awk to Create Service Definitions

Configuring Nagios can be a daunting task if you have hundreds of interfaces on numerous devices to monitor. To help make the process easier, you can (and should) use either a Python or shell script to add your interfaces.

If you don’t have any scripting experience, here is a simple procedure using awk to help make the process easier.

First, save the details from your snmpwalk to a local text file called interfaces.txt:

 root@nag001:/etc/nagios3/conf.d# snmpwalk -v2c -c public 10.0.0.1 mib-2.interfaces >> interfaces.txt

Next, replace all instances of “IF-MIB::ifDescr” with “.1.3.6.1.2.1.2.2.1.8” using your text editor of choice.

Your file should now look something like this:

.1.3.6.1.2.1.2.2.1.8.67 = STRING: GigabitEthernet9/11
.1.3.6.1.2.1.2.2.1.8.68 = STRING: GigabitEthernet9/12
.1.3.6.1.2.1.2.2.1.8.252 = STRING: TenGigabitEthernet7/1
.1.3.6.1.2.1.2.2.1.8.253 = STRING: TenGigabitEthernet7/2
.1.3.6.1.2.1.2.2.1.8.91 = STRING: Port-channel1
.1.3.6.1.2.1.2.2.1.8.92 = STRING: Port-channel2
.1.3.6.1.2.1.2.2.1.8.94 = STRING: Port-channel3
.1.3.6.1.2.1.2.2.1.8.95 = STRING: Port-channel4
.1.3.6.1.2.1.2.2.1.8.96 = STRING: Port-channel5
.1.3.6.1.2.1.2.2.1.8.97 = STRING: Port-channel6
.1.3.6.1.2.1.2.2.1.8.98 = STRING: Port-channel7
.1.3.6.1.2.1.2.2.1.8.99 = STRING: Port-channel8
.1.3.6.1.2.1.2.2.1.8.100 = STRING: Port-channel9 

Now you can run the following awk command to create and format the service definitions for your host and add them to the services.cfg file.

awk ' {print "define service \
{ \n \t use \t \t \t generic-service \n \
\t hosts \t \t \t core1 \n \
\t service_description \t "$4" \n \
\t check_command \t \t check_snmp_router!"$1"!-r 1!-l ifOperStatus \n \
\t } \n"}' interfaces.txt >> /etc/nagios3/conf.d/services.cfg

4 comments

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s