Alarm! Alarm! There’s been a problem!

Incident management is never easy, and it varies greatly depending on the type of incident and available personnel when the problem arises. We all know how incident management groups are formed, and the theory is all very good, but it fails 90% of times. The group is almost always formed with the people available at the moment when the problem takes place. Detecting a problem at 4 a.m. on Sunday is very different from detecting it on Tuesday at noon, when the whole staff is at work. Best of all, that passer-by who has to stand shoulder to shoulder with the rest of the team, even though he or she is not involved at all.

Now that we have our group, it’s time to solve the problem. An incident in control systems cannot be addressed in the same way as one in the corporate network. In the latter, the most important thing is that the problem is solved before it can affect the whole company. Therefore, isolating the problem is essential. How should this be done? Easy; we disconnect the infected machines from the rest of the network. However, availability is the first priority in the control network. We cannot just disconnect a machine and go home, because if that computer happens to control the reactor – let’s assume we’re dealing with a problem in a nuclear plant – you end up with a new Chernobyl. The equipment must stay on and working for as long as possible, at least until there’s time to stop the systems that rely on the infected computers. Therefore, time management, preparation before the incident, backup copies, etc, are extremely important here to minimize the damage.

The contingency plan must begin from the time of the installation/implementation. It might seem tedious and probably it will never be used, but it is important to document and register the steps in the installation of the equipment (configuring the operating system, installing software, importing applications, etc) and measuring the time required for the task. This way, we’ll always know the time needed to recover the system from a complete disaster – if we have the appropriate hardware. More often than not, something as trivial as a backup copy is not always done, or is not correctly documented. It is of great importance to keep track of backup copies and the history of changes in them. Maybe the problem can be easily solved just by going back to a previous stage of the system.

Having spare computers for critical equipment can seem useless and unnecessary (control equipment is expensive) if we think that “our company never has had any problems”. We all know that accidents happen sooner or later, and we’d better be prepared. Returning to the example of the nuclear plant, a failure in a program, due to an update that happens to be incompatible with some function, can be easily solved by replacing the computer with another one that doesn’t have the update. This way, the availability is not seriously compromised. We can recover the infected/problematic computer later. This measure, combined with regular copies of the data, should be enough to avoid major problems with control systems.

What always fails after solving an incident is usually the documentation process. Since the problem has already been solved, the team that’s been working on it goes back to their everyday tasks and forget about writing reports about what they did, and what steps they took to address the incident. If something similar happens again, we can save time if we have everything on writing and we can read it beforehand, so that the solution doesn’t depend on chance or a brilliant idea. Maybe the first time, the brilliant idea proved good, but that might not happen the next time, and the consequences will be disastrous.

We must remember that an incident in a control network can have grave effects on the environment, people and infrastructures, and nothing should be left to chance.

Jairo Alonso
S21sec Labs

Deja un comentario