sobota, 23 marca 2013

Infrastructure Availability Monitoring


Here, just several words about physical infrastructure monitoring – somehow about one of  the component of CIA triad – availability. It is not new or surprising  that IT guys try to make their jobs easier by applying – to some degree - some automatization. Manual  log revision,  or  checking all servers one by one can be tiring and time consuming task.  Before making any statements let have a look why the monitoring is so important.

Why it is important to monitor physical infrastructure? On in other words, why availability monitoring is so crucial? First – the most prosaic one – because administrators want to know that their servers are working and performing tasks. In fact this is the most important argument – and the most difficult. Secondly, to check servers resiliency. What is the system load, what is the performance and how the servers can be used to increase efficiency. This also refers to the problem of virtualization and costs – which at the same time are additional arguments. One may say, that compliance/ regulations are arguments for availability monitoring – nobody wants their data to be lost due to downtime on machines (software and hardware). Security aspect – checking how the system is behaving, and looking for anomalies may be the indicator of DoS attack or any IT Sabotage going on. The last one, is testing purposes – it is sometimes nice to have  estimation how the particular appliance/software is working, if it is resource-demanding and so on.


Why it is important to monitor physical infrastructure?
-  servers resiliency (load distribution, load balancers)
-  virtualization (still performance issue)
- security (anomalies, downtime)
- regulations (data loss)
- costs
- development

There are of course tons of good software (open-source and commercial) that are designed to monitor different factors of our machines. Those are: load average, CPU, disk capacity, IN/OUT on multiple interfaces and others. Many of tools use round-robin databases (additionally use simple in configuration SNMP)  which is great option when there is no enough time for maintenance – such applications just work. It is of course possible to use more sophisticated scripts (using ssh for communication) which can report what is going on with particular processes on the remote server. And here starts interesting discussion. How far should we push administrators for monitoring? Should we monitor every working process on servers, or only the most important ones? 

The work flow is simply. We ‘add’ remote machines to our monitoring server, and present performance and other factors as simple graphs (aggregated!) , with colors or even sound warnings. The worst idea is to view tables with numbers, but it can also be applied, may be with additional visualizations or meaningful color (red/green?). To such solution it would be beneficial to add  some threshold alerts (moving average, moving average with dynamic deviation) and configure emails for notice/ warning and alert levels.  In my humble opinion, this is still not enough.


I would like to make here, a one step back. Imagine that you have really important process working on your remote machines. This process is responsible for collecting data or anything critical for your service. Due to performance issues, or unexpected bug in application (such bugs tend to reveal only when the system is overloaded, and the ‘catch-up’ behavior can be spotted) the ‘really important’ process stopped and your graphs do not show the problem! The solutions in this case is easy – why the machine was overloaded before, and why nobody solved it? What is the point of ‘monitoring the monitoring tool’ when nobody looks  and care for warnings? Human factor is crucial, and even the best availability monitoring tools cannot replace a man.

Combination of  dashboards, warning system, daily checks and human consciousness is the best option when dealing with infrastructure monitoring. 

Brak komentarzy:

Prześlij komentarz