Business continuity planning (BCP) and disaster recovery (DR) are all about preparing for and responding to major adverse events.
These events are very rare so you don’t get much opportunity to test and validate the BCP and DR capability from live practice like you do in most other areas.
So if they’re rare, why bother?
Why shift focus away from things that do happen regularly?
Your customers and other stakeholders understand and accept that sometimes things go wrong. But your ability to respond to them has high expectations. It’s front and center when something does go wrong. In these types of events, the stakes are higher. If you fail to respond effectively to major events it can cause a contract termination or a long-lasting negative impact on your reputation. On the more positive side of that; effective response in disaster scenarios is the best way to build long-term trust and positive customer sentiment.
What types of events are we talking about?
The definition of what types and severity of events trigger your BCP and DR Plans should be based on your own assessment of your company and environment.
The starting point is to consider the surrounding processes like service desk, incident management and sometimes change management. These processes each define how to manage related 'events'. At some threshold, depending on how they’re designed, they will fall short of defining sufficient methods to deal with those (critical) events. For example, if your CTO leaves disgruntled and takes most of the development team with him, you're unlikely to manage that with the service desk. Or if you end up in the news for a privacy breach, it's not the same run-of-the-mill incident response.
As a rule of thumb guide for when to enact the BCP/DR rather than a routine service desk or incident response, you might consider;
- when is an event serious enough to notify Executive Management or even the Board?
- Is it a once in a 3-year event? Or for more mature and stable businesses perhaps a once in a 10-year event? ie. sufficiently rare that it would be a drop-everything and respond situation.
- Will this event require additional management on top of, or instead of, the standard processes?
- Could this event put the business on hold, have a major adverse impact on customers, or catastrophic consequences if managed poorly?
If the answer to any of those is yes, that type or severity of event is likely to require enacting your BCP/DR plans. It’s a good idea to define these types of events within the BCP, DRP, and/or incident management policies and procedures so that everyone is clear on the difference and when each type of response is appropriate.
The types of events to consider, usually in combination with a level of severity, are;
- System outages
- Production data corruption
- Data security breaches
- System security breaches
- Public relations matters
- Attempted or successful external attacks
- Loss of key office locations
- Loss of key personnel
- Any failures that halt critical business functions that your customers rely on
- Third-party failure or breach
How do these events fit into each of the 'plans'?
There is a lot of overlap between the plans for incident response, business continuity, and disaster recovery. They may all be combined into one document and defined process or separated. Generally, the difference is; incident management covers all types of adverse, system-related events regardless of severity and type. Disaster recovery is focused on major IT disruptions for the technical, system side to be able to recover systems, data, and production services in a fast, secure and effective manner. Business continuity covers the broader handling of major adverse events including the non-technical side of it and surrounding non-technical processes of responding to adverse events.
The Business Continuity Plan is commonly believed to be all about the physical offices. But it should also consider the likes of security breaches, loss of key personal, downtime in any key functional areas (people, processes or systems related), and third-party-related issues. It should consider anything that may prevent the continuity of important business functions, your services, or even the survival of your business.
What's documented in each of these plans?
Incident Management & Response
The Incident Management Policy, and/or the Incident Response Plan/Policy should cover end-to-end handling of unplanned and adverse events. This includes how they are identified, assessed, classified, and then the response to those, how they may be 'closed' (the criteria or requirements), and any post-incident review activities for 'lessons learned' to prevent a recurrence. There should also be a clear linkage to the Change Management Policy or process for how incidents feed into product fixes and the relative priority of those compared to other product change plans. Incident Management is explored further in Best Practices: Incident Management.
Disaster Recovery Plan (DRP)
The Disaster Recovery Plan is directly linked to both the incident management process and the Business Continuity Plan. Its focus is how to recover the critical system functions in the event of a major event that disrupts them.
In contrast to the BCP that has a broader operational focus, the DRP is focused on the technical side of recovering data and systems back to normal operation. In modern times with infrastructure as a service and integrated DRP functions, the DRP is often a very simple process and document. It may simply set out the steps to recover data and the systems from backup, as well as a periodic (quarterly, annual) review process to verify the recovery practices are successful. It may also be supported by multiple availability zones for automatic failover in a disaster scenario where a data center is lost. The DRP like any policy document should set out roles and responsibilities, as well as any key external or internal contacts related to effectively enacting the plans.
Business Continuity Plan (BCP)
The BCP is often the most comprehensive of these three areas. It needs to broadly identify and address any types of events that may disrupt the continuity of your people, processes, systems, or services. For those events, it needs to clearly identify the key dependencies, specific objectives and priorities, and the practical components of how to respond effectively. Then like all policies, procedures and plans it should set out roles and responsibilities and the overall governance of how the BCP is reviewed, updated, and verified periodically.
Business Impact Analysis (BIA)
The BCP may start with a Business Impact Analysis; what are the critical functions and what happens if they are impacted? This is a good starting point to understanding what types of events may disrupt the continuity of your business, by which events impact these critical functions.
Recovery Time Objectives (RTO's)
Following on directly from the BIA, how quickly do these critical functions need to be recovered before it has a significant adverse impact. That may be, for example, your customers are materially impacted and unable to continue their own operations, or the impact is serious enough it causes repetitional damage or financial damage if there are covenants in your contract.
Scenarios & Responses
The scenarios usually come from a brainstorming exercise to come up with a list of possible events that may cause a continuity issue or requiring enactment of the BCP in some form. They should consider the business impact to identify the event types but also form high-level response plans that fit with the recovery objectives. For the purpose of the BCP, you may find grouping scenarios is worthwhile, where the responses are likely to be similar for similar types of events. The response plans should be high-level enough that they can be quickly and easily referenced and followed, but also sufficiently clear or linked to further detail, to enable them to be effectively carried out. It's often appropriate to point to "who" as opposed to "what" will be done, as most major events require discretion at the time. But you want to ensure it's the right person with authority, expertise, and the right resources to be managing it.
Incident Response Team
The Incident Response Team is a predefined team of responsible participants for coordinating and executing the BCP. This team should have a prior briefing on the essentials of their role and feel prepared to be able to enact the BCP. In the BCP itself, there should be contact details for this team for other members of the business to know whom to contact in the event that the BCP may need to be triggered, or is in practice.
The response playbooks or steps should include the high-level pre-planned steps that may be necessary if the types of BCP events occur. This may be a flow chart, a sequence of considerations, or a step-by-step guide. It's impossible to completely plan out all steps that may be performed in the event of an unforeseen event, which is the nature of when the BCP is enacted. The purpose is to prompt considerations that may otherwise be missed, forgotten, or poorly executed in the heat of the moment. Having this reference point helps reduce the likelihood of that poor execution.
There are various other things that can be included in the Business Continuity and Disaster Recovery Plans. These should each be tested at least annually to check that they are appropriate and effective. Often that's done via a desk-based run-through or simple simulations, as it's not always feasible to do live tests or more real-world simulations. The purpose of doing some form of testing is to validate the assumptions made in the BCP and DR Plans and identify areas of improvement to better prepare. It may be as simple as identifying that the plan has a communications plan but the list of contacts to communicate with has not been prepared yet.
AssuranceLab's Best Practices Series
AssuranceLab's best practices series is about highlighting the "real operational benefits" that come from effective control practices. At best, they support your company culture, provide structure and clarity, and enable scalable growth. At worst, they tick the box of what your customers expect, reduce the reactive "firefighting" and time-wasting, and help you demonstrate your compliance with leading standards like SOC 2 and ISO 27001.