Skip to content

Visualization of historical Nagios data

At the end of this semesters "Data and Information Visualization" class, we had to group up, take some interesting data and make a fancy visualization out of it. Most of the teams decided to visualize some kind of geospatial data in the field of data visualization. The Fukushima incident, pandemics and earthquakes in general were very popular. My teammate and I decided to take a topic from the field of information visualization. As he is working as network administrator for our university we had a chance to get hold of the universities Nagios log files of the last three years.

Data

As mentioned in the previous section the data consisted of Nagios logs, starting at the end of 2009 till March 2012. In total 810 days and a total of 956 individual hosts. Each Nagios log file consists of a large number of host states seperated by linebreak and with the following (relevant) fields:

Field Meaning Characteristics
date (in epoch time) quantitative
service what was measured nominal
hostname where it was measured nominal
state OK or error ordinal

In order to reduce the data to a meaningful subset we decided to only use three different states, to indicate the health status of single host and the network:

  1. PING - indicates if the host is only
  2. SSH - checks if local superuser login is possible (host is in correct boot state)
  3. GSSD - indicates if users are able to log in an retrieve their roaming profile

After this we had to handle the preprocessing from log files to a more query-friendly storage (MySQL). Therefore I hacked some ruby code into a sophisticated log converter and you can trust me this part was really f**** up. Nagios logs are a mess ... you won't believe. But finally we had about 240.000 time spans in our DB, everyone representing a single fail state.

User

The user modeling was kind of the easiest part of this project. We decided to develop our visualization for the only user who already uses Nagios, the network administrator. This is directly related to the underlying problems of Nagios and the visualization goals as you will see in the next section. Later on, when we had decided for the first visualization technique, we identified a secondary user. As you will see, the chosen visualization is also able to assist management/human resources to identify weak spots, schedule holidays and hire additional staff.

Goals

The main problem with Nagios is that a user can only view a snapshot of the current state of his network, but as everyone who has ever worked in this field knows that operating a network makes it necessary to also have a historical view onto the data. For example may recurring problems with a single server lead to problems on a large number of hosts. Therefore the operator should be able to identify local problems in a global, historical context.

  1. global: overall health status in a long period (e.g. one year), support identification of recurring problems
  2. local: detailed long term development of health status of single hosts in selected time period

Many of these problems occur due to temporal effects:

  • seasonal effects 
    • e.g. winter, summer - problems with ir conditioning 
    • holidays
  • day of week 
    • e.g. weekdays vs. weekends 
    • patchdays (eg. Microsoft, every 2nd tuesday in month)

This resulted in the first sub-goal for our visualization. We needed a good overview on the health of the network as an entry for a more detailed analysis. At the same time it satisfies the first point of the Visual Information Seeking Mantra "Overview first, zoom and filter, then details-on-demand". To address this issues we decided to implement a techniques proposed by Rick Wicklin and Robert Allison from the SAS Institute. You can find a poster of their idea here. They used it to analyse flight patterns in the US airline traffic.

With two or three days left for the implementation phase of our own solution, I came up with this:

In order to support the user in detailed analysis, two more visualizations have been developed. These rely on traditional line graphs as they are known from similar tools. The final visualization look like the following screenshot, allowing the user to proceed as follows:

  1. The user selects the year he wants to work with. The program fetches the data from database and renders them. 
  2. After selecting a year from the drop down menu, the user can select a single day for further analysis from the calendar. The current selection is shown in 2b, the selected cell is highlighted and the detailed visualizations will be shown. 
  3. The UI presents a list of the top 10 problematic hosts on that day (permanent offline hosts are filtered!). He then can choose one of these hosts and will be presented with a line chart showing the long term development of the selected host. 
  4. Selecting a specific day in the line chart selects this day, all visualizations will be updated according to the selection.


Conclusion 

Does it represent the underlying phenomena?

In the operation of a network with a large number of computers recurring problems are very common. This may be seasonal effects, such as problems with cooling in the summer, or weekly / monthly conducted Patch Days. Various factors, such as network overload or unavailability of central servers (DNS, File Server, ...) can also lead to subsequent errors on the clients. The chosen visualisation for the annual overview supports the detection of recurring events significantly better than those in other software, that usually uses line charts, by the kind of arrangement and the chosen color scale. In contrast to this, line charts were used in the daily overview and in the visualization of the long-term development of a single host. Here the separation of individual services and the corresponding time dependent, detailed information is essential for a meaningful analysis by the user.

Does it help the user to understand the phenomena?

As already described, after the analysis of the underlying data we have identified a primary and a secondary user of the visualization. As a technical operator, the primary user knows the specific characteristics of the network and the associated hosts. On the global level the visualization allows him to quickly identify problematic time periods and his specific knowledge and the other two visualizations enable him to narrow down problems to individual hosts. The data obtained in this way can then be used to investigate a single day more accurately and to recognize coherences at the host level. Based on these findings the network operator may perform improvements to the network. The chosen workflow from a global perspective to a local one thus provides an intuitive way of exploring the given data and locating problems. For the secondary user, only the first visualization may be relevant, because of his lack of technical insight. However, it may be extremely helpful in the planning of human resources if the network is subject to seasonal fluctuations.