Social Network For Security Executives: Network, Learn & Collaborate
Part 13 of 15: Business Continuity and Disaster Recovery Plans
What Is It? Disaster Recovery and Business Continuity Planning (DR/BCP) is another one on this list I think should be rated a lot higher than it is because of what it represents and how crucial it can be when it all goes sideways on your business. From an end-user perspective, there would be little resistance because it ideally happens transparently to them. With some initial upfront cost to establish the DR/BCP, and ongoing maintenance costs, I often wonder why businesses often don’t have a plan or if they do, it’s badly outdated and lacks the cohesion to bring it all together at crunch time.
Many have the idea that a DR/BCP is there for when things go very badly, and the existence of the business is in jeopardy. Well, it is, but it should also look after the everyday such as system restores for when a server dies, if a piece of switchgear gets replaced, or if some important files get deleted or corrupted accidentally (or on purpose for that matter). It’s often not the once-in-a-century events that represent the biggest risk but a combination of the smaller, more frequent risks that, when combined, are equally destructive.
Another false idea is that a DR/BCP is strictly technology focused. Really, the safety and well-being of your staff is first and foremost followed by the data, the technology, and the environment (the order of the last three is subjective but people always come first). There are many ways to approach this and every business is unique, but the common thread is that everyone needs DR/BCP, even at a home level for yourself. Why? Ever have a computer crash and lose all your personal data, pictures, and internet access? Did you have a backup, or did you simply resign yourself to the fact it was all gone forever? Scale that up a thousand times and imagine if your employer could not recover. In fact, the knock-on effect would be even greater as it trickles down.
I think we’ve established you need DR/BCP from a professional and personal perspective. WHEN (not if) something happens, you need to be able to pick yourself up and carry on.
Where Do I Start? Start with assessing what you have and prioritising it in order of how important it is to the business. Early on, you should establish your Recovery Time Objective (RTO) which is how soon you need to return to operation. This, of course, can be staged with milestones where some things must be restored before others and will be decided by your system and service priority. If you’re a 24/7 shop, you’ll have a different RTO than someone that operates 8 to 5, Monday to Friday. Establish your Recovery Point Objective (RPO) which determines your tolerable loss of data and operations. Of course, you’ll likely say you don’t want to lose anything but if you only back up daily, you could potentially lose up to a full day’s worth of work. This could be a few dollars or a few million, so your DR/BCP should take this into consideration. There are mitigations such as cached local copies of files and local replication between different servers, so it’s not an absolute – be sure to understand where your data and replicas reside when planning.
Don’t overlook a planning exercise where you consider the odds and probabilities of events occurring. In Canada, we had to plan for extreme cold and winter storms, such as a massive ice storm that crippled large parts of Eastern Canada in the 1990’s. Here in Australia, we need to concern ourselves with floods, cyclones, heatwaves, and different natural events than the Northern hemisphere does. Around the world, there is the omnipresent issues of civil unrest, terrorism, infrastructure failures, and cybercrimes. Be sure to fully understand the threats that face your enterprise regardless of where you are.
Now that you know what systems and data you have, their priority, how soon you need to be back in operation, and how much time and data loss you can accept, you can plan the “how”. This is not an excuse to jump immediately to technology; that will come soon enough. Now you should plan who is doing what and then get them involved. Always define a role by a team rather than an individual and each team should have two or more people able to perform that role. People are single points of failure at critical times, so be sure you have a “Plan B”. If Bob can’t get in to begin the recovery, the whole mess shouldn’t come to a screeching halt. More likely, if Bob can’t make it in then Fred can. You get the idea.
Oh yeas, as a bonus tip, you generally could have the people before and after a step familiar with how it works in case they need to assist. This way, each person in the chain knows what is happening before its time to do their part and they know what is happening after. Let’s say Sue is building a new virtual server environment and she knows that Bob will begin the restore when she is done. Bob knows what Sue is doing and can prep and help. Meanwhile, Kim knows that Bob will restore the servers in Sue’s new environment and then Kim can test the system availability. A simplified view, I know, but your DR/BCP is a linked chain of tasks; not a series of independent events.
Now that you have the people and process, sort out your environment and technology. Where will you recover? Is this a fault that simply requires an in-place restoration or is it a larger scale issue that requires the team to work remotely for a period or even indefinitely? Do you have a hot site, warm site, or cold site? Are you able to simply fail over to an alternate data centre or office, spin up an existing location and restore, or start from scratch? These parts are important to meet your RTO and RPO.
Finally, start looking at the technology. What are you concerned about and how will you back it up and recover it, including systems, data, and policies. Backup to tape? Server to server replication? Offsite media storage? Data centre to data centre replication? Cloud backup? Spare hardware? High availability of important infrastructure? Your DR/BCP should flesh this all out in detail, eliminating single points of failure and based on a fluid start-to-finish plan with intermediate goals to return your business to operation as smoothly and as quickly as possible, from restoring a deleted file to complete loss of facilities.
How do I make It Work? First things first. Do you have a DR/BCP? Any hesitation in answering this question indicates you have work to do. Maybe it’s old and just needs updating. Maybe it’s incomplete and disjointed. Maybe you don’t even have one. Stop. Breathe. Relax. When it comes to this, please do not hesitate to put your hand up and ask for help. Planning is the most critical part of all this followed by testing your plan to make sure it works. Far too many put a plan in place and then never test it. The excuse? We’re not allowed to have the downtime. Really? Then what will you do WHEN something happens, and you won’t know if your DR/BCP will work. I’ve encountered a lot of these over the past few years that supposedly have a DR/BCP but have never actually tested it.
Assemble a team. Get the right people involved from all areas of the business and not just IT. Everyone is a stakeholder when it comes to the existence of the business. Be sure that each area has at least two representatives if possible; everyone needs to be kept in the loop.
Figure out what threats you face and the odds and probabilities of each occurring, then use that information to determine you best method of recovery and how much you need to invest to achieve that. Budgets are natural limitations, but be sure to balance them against what you are protecting. Don’t spend 10K to protect 10M, but don’t spend 10M to protect 10K. You get the idea.
Figure out what you have, how important it is, how long it can be down, how fast it needs to come back up. Consider data, systems, applications, services, and everything else that makes your business operate and prioritise it. For bonus points, consider your dependencies. It’s hard to authenticate to an application when Active Directory is not yet operational. Connectivity is king.
Figure out who is doing what and when and that everyone knows their roles and responsibilities. Be sure that roles are defined as groups and not as individuals and that each group consists of two or more people. Avoid single points of failure at all costs, especially when it comes to people.
Figure out what technology will be used and where it will be used, planning for the possibility that the here and now may not always exist. Figure out what works best from multiple data centres to cloud to backups and offsite media rotation. Of course, these need to consider things like data sovereignty and the like, and must respect the classification of information such as top secret or highly protected data.
I could go on and on, but I try to keep these things high-level and short but hopefully I’ve managed to get you thinking about DR/BCP. Things can get complex fast, so if you like, pick up the phone and give us a call so we can help you.
Pitfalls? The two biggest ones are not having a DR/BCP and not testing it if you do. In all fairness, if your DR/BCP is out of date, it could cause more havoc than the original issue that causes it to be invoked. Regular desktop or paper-based tests are a good walk-through to make sure it all lines up and makes sense, but I would strongly recommend you perform annual live testing of your DR/BCP to make sure it works because you need to count on it when it matters most.
Be sure to update the DR/BCP when there are significant changes to systems, infrastructure, applications, processes, policies, or personnel. Restoring your business should be to its current state; not a state that existed years ago.
Ghosts in the Machine? Things can and do go wrong and sometimes even the backup plan needs a backup. Backup tapes might be corrupted. Sometimes the office is inside an exclusion zone that cannot be accessed. Sometimes a shared DR facility will have many other organisations competing for space and time during a large-scale event. Try to plan for as many eventualities as you can and be ready to deal with then when they occur.
Anything Missing? Always consider the human impact in DR/BCP. People are not robots and may have crises of their own to deal with. An individual’s health and wellbeing come first and foremost as does that of their family. The technology is secondary to the people that use it and the business that relies on it. Be sure to keep your priorities straight and allow for these kinds of wildcards. Your DR/BCP plan will work if it’s realistic.
Disclaimer: The thoughts and opinions presented on this blog are my own and not those of any associated third party. The content is provided for general information, educational, and entertainment purposes and does not constitute legal advice or recommendations; it must not be relied upon as such. Appropriate legal advice should be obtained in actual situations. All images, unless otherwise credited, are licensed through ShutterStock