DevOps Zone is brought to you in partnership with:

Patrick Debois has been working on closing the gap between development and operations for many years. In 2009 he organized the first devopsdays.org conference and since then the world is stuck with the term 'devops'. Always seeking for opportunities to optimize the global IT instead of local optimizations. Patrick is a DZone MVB and is not an employee of DZone and has posted 39 posts at DZone. You can read more from them at their website. View Full User Profile

Monitoring Over 10K URLs in Nagios

05.08.2012
| 7816 views |
  • submit to reddit

10K websites x 5 URL's to monitor

For our Atlassian Hosted Platform, we have about 10K websites we need to monitor. Those sites are monitored from a remote location to measure response time and availability. Each server would have about 5 sub URLs on average to check, resulting in 50K URL checks.

Currently we employ Nagios with check_http and require roughly about 14 Amazon Large Instances. While the nagios servers are not fully overloaded, we make sure that all checks would complete within a 5 minutes check cycle.

In a recent spike we investigated if we could do any optimizations to:

  • Use less server resources, not only to reduce costs but also avoiding the management of multiple nagios servers which don't dynamically rebalance checks across multiple nagios hosts.
  • Have all checks complete within a smaller window (say 1 minute), as this would increase our MTTD

While looking at this, we wanted the technology to be reusable with our future idea of a fully scalable and distributed monitoring in mind (think Flapjack or the new kid on the block Sensu). But for now, we wanted to focus on the checks only.

In the first blogpost of the series we look at the integration and options within Nagios. In a second blogpost we will provide proof of concept code for running an external process (ruby based) to execute and report back to nagios. Even though Nagios isn't the most fun to work with, a lot of solutions that try to replace it, focus on replacing the checks section. But Nagios gives you more the reporting, escalation, dependency management. I'm not saying there aren't solutions out there, but we consider that to be for another phase.

Check HTTP

The canonical way in Nagios to run a check is to execute Check_http.

F.i. to have it execute a check if confluence is working on https://somehost.atlassian.net/wiki , we would provide the options:

  • -H (virtual hostname), -I (ipaddress) , -p (port)
  • -u (path of url) , -S (ssl) , -f follow (follow redirects)
  • -t (timeout)

     

    $ /usr/lib64/nagios/plugins/check_http -H somehost.atlassian.net -p 443 -u /wiki -f follow -S -v -t 2 HTTP OK: HTTP/1.1 200 OK - 546 bytes in 0.734 second response time |time=0.734058s;;;0.000000 size=546B;;;0

     

Some observations:

  1. For each check configure Nagios will fork twice and exec check_http, avoiding this would improve performance as fork is considered expensive.
  2. If we were to have many URL's on the same host, we can't leverage connection reuse, making it less efficient
  3. For status checking, we can configure it to use the -J HEAD if our check doens't rely on the content of the page (saving on transfer time and reduce check time)
  4. Redirects: not an issue of Nagios, but we currently have quite a few redirects going from the login-page logic, reducing those would again improve check time.

We can reduce part of the forks by using the use_large_installation_tweaks=1 setting. The benefits and caveats are explained in the docs

Check scheduling

Nagios itself tries to be smart to schedule the checks. It tries to spread the number of service checks within the check interval you configure. More information can be found in older Nagios documentation .

Configuration options that influence the scheduling are:

  • normal_check_interval: how long between re-executing the same check
  • retry_check_interval: how fast to retry a check if it failed
  • check_period: total time for a complete check cycle
  • inter_check_delay_method: method to schedule checks (
  • service_interleave_factor: time between checks to the same remote host
  • max_concurrent_checks: obvious not ?
  • service_reaper_frequency: frequency to check for rogue checks

Default for the inter_check_delay_method is to use smart, if we want to execute the checks as fast as possible

  • n = Don't use any delay - schedule all service checks to run immediately (i.e. at the same time!)
  • d = Use a "dumb" delay of 1 second between service checks
  • s = Use a "smart" delay calculation to spread service checks out evenly (default)
  • x.xx = Use a user-supplied inter-check delay of x.xx seconds

Distributing checks

When one host can't cut it anymore, we have to scale eventually. Here are some solutions that live completely in the Nagios world:

Our future solution would have a similar approach to dispatching the checks command and gathering the results back over queue, but we'd like it to be less dependent on the Nagios solution and be possible to be integrated with other monitoring solutions (Think Unix Toolchain philosophy) A great example idea can be seen in the Velocityconf presentation Asynchronous Real-time Monitoring with Mcollective

Submitting check results back to Nagios

So with distribution we just split our problem again in smaller problems. So let's focus again on the single host running checks problem, after all, the more checks we can run on 1 host, the less we have to distribute.

Nagios Passive Checks easily allow you to uncouple the checks from your main nagios loop and submit the check results later. NSCA (Nagios Service Check Acceptor) is the most used solution for this.

NSCA does have a few limitations:

Opsview writes:

  • Only the first 511 bytes of plugin out was returned to the master, limiting the usefulness of the information you could display
  • Only the 1st line of data was returned, meaning you had to cramp output together
  • NSCA communication used fixed size packets which were inefficient
  • While results were sent, Nagios would wait for completion, introducing a bottleneck
  • If there was a communication problem with the master, results were dropped

This lead them to using NRD (Nagios Result Distributor)

Ryan Writes:

"What no one tells you when you are deploy NCSA is that it send service checks in series while nagios performs service checks in parallel"

This lead him to writing A highperformance NSCA replacement involving feeding the result direct into the livestatus pipe instead of over the NSCA protocol baked into nagios On a similar note Jelle Smet has created NSCAWEb Easily submit passive host and service checks to Nagios via external commands

We would leverage the Send NSCA Ruby Gem

Why is this relevant to our solution? Without employing some of these optimizations, our bottleneck would shift from running the checks to accepting the check results.

Another solution could be run an NRPE server , and we could probably leverage some ruby logic from Metis - a ruby NRPE server

Conclusion

Even after the following optimizations:

  • Using head vs get
  • Large installation tweaks
  • Tuning the inter_check_delay_method
  • Parallel NSCA submissions vs serial submissions

we can still optimize with:

  • Avoid the fork process by running all checks from the same process
  • Reusing the http connection across multiple requests for the same host (potentially even do http pipelining


In the next blogpost we will show the results of proof of concept code involving ruby/eventmachine/jruby and various httpclient libraries.

Published at DZone with permission of Patrick Debois, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)

Comments

Paul Russel replied on Sun, 2012/06/10 - 9:20am

Fork is not "expensive"; it's one of the least expensive system calls out there. What it does is copy a page table entry (not even 1K) in memory, and have both processes run from there on. That's it. It's because fork is so cheap, that forkbombs (calling fork() in a loop) is so destructive; it can bring any server down in seconds, no matter its processing power.

Ujas Ujass replied on Fri, 2013/08/23 - 6:45am

A friend of mine started to use Nagios after the previous software could not handle checking so many URL`s, the Check_http. function works perfectly. He knows a lot of things about computers and suggested me to ask for help a Web Design and Hosting company that in his opinion has the best offers, I recently created an on-line shop and need to have the best design for my website.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.