Here, just
several words about physical infrastructure monitoring – somehow about one of the component of CIA triad – availability. It
is not new or surprising that IT guys
try to make their jobs easier by applying – to some degree - some
automatization. Manual log
revision, or checking all servers one by one can be tiring
and time consuming task. Before making
any statements let have a look why the monitoring is so important.
Why it is
important to monitor physical infrastructure? On in other words, why
availability monitoring is so crucial? First – the most prosaic one – because administrators
want to know that their servers are
working and performing tasks. In fact this is the most important argument –
and the most difficult. Secondly, to check servers
resiliency. What is the system load, what is the performance and how the
servers can be used to increase efficiency. This also refers to the problem of virtualization and costs – which at the same time are additional arguments. One may
say, that compliance/ regulations
are arguments for availability monitoring – nobody wants their data to be lost
due to downtime on machines (software and hardware). Security aspect – checking how the system is behaving, and looking
for anomalies may be the indicator of DoS attack or any IT Sabotage going on. The
last one, is testing purposes – it is
sometimes nice to have estimation how
the particular appliance/software is working, if it is resource-demanding and
so on.
Why it is important to monitor physical
infrastructure?
- servers resiliency (load distribution, load balancers)
- virtualization (still performance issue)
- security (anomalies, downtime)
- regulations (data loss)
- costs
- development
There are
of course tons of good software (open-source and commercial) that are designed
to monitor different factors of our machines. Those are: load average, CPU,
disk capacity, IN/OUT on multiple interfaces and others. Many of tools use
round-robin databases (additionally use simple in configuration SNMP) which is great option when there is no enough time
for maintenance – such applications just work. It is of course possible to use
more sophisticated scripts (using ssh for communication) which can report what
is going on with particular processes on the remote server. And here starts
interesting discussion. How far should we push administrators for monitoring?
Should we monitor every working process on servers, or only the most important
ones?
The work
flow is simply. We ‘add’ remote machines to our monitoring server, and present
performance and other factors as simple graphs (aggregated!) , with colors or
even sound warnings. The worst idea is to view tables with numbers, but it can
also be applied, may be with additional visualizations or meaningful color (red/green?).
To such solution it would be beneficial to add some threshold alerts (moving
average, moving average with dynamic deviation) and configure emails for
notice/ warning and alert levels. In my
humble opinion, this is still not enough.
I would
like to make here, a one step back. Imagine that you have really important
process working on your remote machines. This process is responsible for
collecting data or anything critical for your service. Due to performance
issues, or unexpected bug in application (such bugs tend to reveal only when
the system is overloaded, and the ‘catch-up’ behavior can be spotted) the ‘really
important’ process stopped and your graphs do not show the problem! The
solutions in this case is easy – why the machine was overloaded before, and why
nobody solved it? What is the point of ‘monitoring the monitoring tool’ when
nobody looks and care for warnings? Human
factor is crucial, and even the best availability monitoring tools cannot
replace a man.
Combination
of dashboards, warning system, daily
checks and human consciousness is the best option when dealing with
infrastructure monitoring.