This article describes a plan and implementation process for disaster recovery planning. The secret to success in our experience is to involve the local response team from the outset of the project.
The disaster recovery plan is designed to assist companies in responding quickly and effectively to a disaster in a local office and restore business as quickly as possible. In our experience, participation in the planning and implementation process is more important than the process itself and helps ensure that the local response teams understand what they need to do and that resources they need will be available.
- DRP – disaster recovery plan
- BIT business impact timeline
- ERT emergency response team
- BIA business impact assessment
- Countermeasures physical or procedural measures we take in order to mitigate a threat
- PRT primary response time; how long it takes (or should take) to respond (not resolve)
- RRP recovery and restore plan; recovery from the disaster and restore to original state
DR planning is not about writing a procedure, getting people to sign up and then filing it away somewhere. In the BIT (business impact timeline) we see a continuum of actions before and after an incident. In the pre-incident phase, the teams are built, plans are written, and preparedness is maintained with training and audit. After an incident, the team responds, recovers, restores service and assesses effectiveness of the plan.
T=ZERO is the time an incident happens. Even though one hopes that disaster will never strike, refresher training should be conducted every 6 months because of employee turnover and system changes and self-audits conducted by the ERT every 3 months.
Building the DR plan
Build the ERT
Assign a 2-person team in each major office (for small offices with one or two people, then the employee will do it himself) to be the ERT. The people in the ERT need to have both technical and social skills to handle the job. Technical skills means being able to call an IT vendor and being able to help the vendor diagnose a major issue such as an unrecoverable hard disk crash on an office file and print server. Social skills means staying cool under pressure and following procedure in major events such as fire, flooding or terror attack.
In addition to an ERT in each office, one ERT will be designated as “response manager”. The response manager is a more senior person (with a backup person) that will command the local teams during crisis, maintain the DRP documentation and provide escalation.
The local response team becomes involved and committed to the DRP by planning their responses to incidents and documenting locations of resources they need in order to respond and restore service.
DR Planning Pre-incident activities
The purpose of the call is to introduce the DRP process and set expectations for the local ERT. Two days before the call, the local team will receive a PowerPoint presentation describing DRP, the implementation process and the BIA worksheet. At the end of the call, the team will take a commitment to fill out the worksheet and prepare for a review session on the phone one week later.
Business Impact Assessment (BIA)
In the BIA, the team lists possible incidents that might happen and assesses the impact of a disaster on the business. For example there are no monsoons in Las Vegas but there might be an earthquake (Vegas is surrounded by tectonic faults and number 3 in the US for seismic activity) and an earthquake could put a customer service center in Vegas out of business for several days at least.
Recover and Restore
Recovery is about the ERT having detailed and accessible information about backups – data, server, people and alternative office space. Within 30 days after a disaster, full service should be restored by the ERT working with local vendors and the response manager.
It may also be useful using http://www.connected.com for backup of data on the distributed PC’s and notebooks.
DR Plan Review
The purpose of the call is to allow each team to present their worksheet and discuss appropriate responses with the global response manager. Two days before the call, the teams will send in their BIA worksheet. The day after the call the revised DRP will be posted.
Filling out the DRP worksheets
There are two worksheets the BIA worksheet (which turns into the primary response checklist) and the RRP (recover and restore plan) worksheet, which contains a detailed list of how to recover backup resources and restore service.
Filling out the BIA worksheet.
In the BIA worksheet, the team lists possible incidents and assesses the impact of a disaster on the business. In order to assess the impact of a disaster on the business we grade incidents using a tic-tac-toe matrix.
The team will mark the probability and impact rating for an incident going across a row of the matrix. A risk might have probability 2 and impact 5 making it a 7, while another risk might have probability 1 and impact 3 making it a 4. Countermeasures would be implemented for the 7 risk before being implemented for the 4 risk.
BIA worksheet step by step
- Add, delete and modify incidents to fit your business
- Grade business impact using the “tic-tac-toe” matrix for each incident.
- Set a primary response time (how quickly the ERT should respond not resolve)
- Establish escalation path escalate to local service providers and response manager within a time that matches the business impact. Escalate to local vendor immediately and escalate to response manager according to following guidelines:
- Risk > 6 within 15
- Risk <= 6 and > =4 within 60
- Risk < 4 within 2 hours.
Filling out the RRP worksheet.
In the RRP worksheet, the team documents in detail how to locate and restore backups and how to access servers (in the network and physically).
Maintaining the DR plan
Once every 6 months, the response manager will run an unannounced exercise, simulating an emergency. In a typical DR exercise the local ERT will be required to:
- Respond to a single emergency (for example earthquake)
- Verify contents of RRP check list
- Physically locate backups
After completion of the ER plan the local response team needs to perform periodic self-audits. A member of the local ERT will schedule an audit once every 3 months and notify the response manager by email regarding the date.
- The audit should take about 1 hour and will check documentation and backup readiness
- Documentation readiness
- Make sure telephone numbers of critical suppliers posted at entrance to office. Make sure numbers are current by calling.
- Read primary response sheet
- Wallet-sized cards with emergency phone numbers and procedures, to be carried by all employees.
- Onboard list who is in the office today and who is traveling or on vacation
- Backup readiness
- Local backup files/tapes