Backups are a necessary evil in System Administration, and although most of us dislike the process, it is by far the most critical element under the IT umbrella. I like to think of the whole process as a Recovery Plan instead of a backup plan, because in the end all I care about is that the data is recovered properly and quickly. One of the biggest pitfalls new System Administrators or System Administrators new to a particular company do, is that they do not test their own backups. Not being able to recover information from a system you designed or recommended is the quickest and surest way to get fired.
1. Risk Assessment
Although this sounds simple, a true risk assessment is rarely done and far out of the reach of the average business. Although hiring actuaries and combing through insurance statistics is ideal, this is far from what companies are willing to do for a data recovery plan. Many System Administrators find companies that have not experienced data loss are less willing to be thorough in their analysis and budgeting.
One of the first things in a recovery plan is to write down the possible external risks, some examples:
- Physical Break-in / Theft
- Virtual Break-in (Cracker)
Ask your Insurance Company what some likely issues will be, they will be happy to tell you every possible disaster scenario.
Next, think of internal risks such as:
- Data Corruption
- Hardware Failure (Electrical or Mechanical)
- User Error (accidental deletion)
- User Malice (non-accidental deletion)
To supplement risks, look through your company’s history for any previous data loss and the reasons for it. Enlist the help of your industry colleagues for any scenarios not on the list. You’ll be amazed at some of the “once-in-a-lifetime” stories you’ll hear – some may be applicable to you.
2. Impact Rating
An impact analysis on each scenario listed above should be created. This involves the hardware, software and data (all systems) that are affected and how. Does the impact involve a full or partial outage? If hardware is likely damaged, how quickly can it be replaced? Etc. Suppose there is a fire in a server room and the servers are damaged. You have the LTO tapes as backups, but no server, and no LTO drive to restore it with. How many days will it take to get an LTO drive and from where? During this phase of planning a vendor and consultant list should be compiled.
3. Risk Rating
This can be included in the budgetary section, or done beforehand. Combine the risk assessment items and impact ratings and sort them. This is important. You should implement a recovery plan that encompasses as many items as possible. By sorting, it makes the budgetary step easier when you need to cut coverage because of costs.
There are many methods to ensure data continuity in certain scenarios, but be careful as there is rarely a one-size-fits-all approach to backups.
Mirroring: Is designed to mitigate single-point hardware failures. If your database server fails, having a mirrored server will ensure your data is available. Mirroring at another location may also solve router, switch and connection issues for external clients. Mirroring does not protect against corruption, viruses, user error or malice. On-site mirroring does not protect against theft, fire, flood, etc.
Removable Backup Media: Backups to media protect against issues such as corruption, viruses, user error or malice. If you leave these items on-site they will not protect against theft, fire and flood. If they are taken off-site, it will take longer to retrieve your data in case of loss. With backup tapes, hard drives and cds, the backup data itself is typically a day or older. If this method of backing up is your only method, be sure your business can survive with older data. Be sure to have multiple days of backups, or a weekly backup with incremental backups per day. Often times users will not report data loss until days after the event, by which time relevant backups have already been overwritten with newer, useless backups.
Non-removable Backup Media: Items such as NAS (Network Attached Storage), DAS (Direct Attached Storage) or SAN (Storage Area Network) can be used to backup servers, virtual machines and data. The issue with these is that they are not removable. This will not protect against theft, fire, flood, etc.
Be careful of proprietary systems used to backup your data. Be sure to audit your recovery scenario regularly to ensure your backups can be recovered. Companies go out of business, and items are discontinued. Do you have any backups on jazz drives? How difficult would it be to recover if you had to find a new jazz drive? Don’t know what a jazz drive is? Exactly!
5. Budgetary Concerns
With your sorted list in hand, you can now plan for the items you need to mitigate any disasters. Protecting against many scenarios may prove to be prohibitive in cost. If you do not make the budgetary decisions, be sure that your list is as comprehensive as possible. It is up to IT to determine the impact of all scenarios, and it is up to the budgetary members to determine how much they want to spend. If they say no, you have at least outlined all the possibilities.
In your cost analysis, include replacement or redundancy items such as:
- Backup Storage (Tapes & Tape Drives, Hard Drives, CDs)
- Backup Servers, NAS, SAN, DAS
- Mirrored Servers
- Redundant Connections (Internet and Cabling)
- Backup Routers, Switches, etc
Part of your impact analysis should include what is damaged or lost. If you have the tapes, but no tape drive, you will need to replace it in order to retrieve the data. Make sure you have the ability to read your backups when you need. If it takes 3 days to ship a new tape drive, but the cost is minimal, consider having a backup tape drive in stock.
While they are numbers out there for determining how much to spend on a backup and recovery system, you should make your decisions based on the impact and risk. If your data is your business’ main asset, you should spend a larger chunk of your budget to protect it. If time is critical in retrieving your data, the solution may include keeping extra hard drives, servers, router and switches in stock. If time is not an issue and an outage can be handled for days, you can order items at the time of recovery.
Determining budget can be a mix of preventative costs and the cost of downtime to the business (lost sales, lost productivity). Ideally disaster scenarios should have a cost to the business attached to them. If a server failure results in $0 productivity for the day, the overall impact can be many thousands of dollars per day – that fact alone may convince management to have a redundant server available.
6. Deployment and Testing
Don’t forget this step! Backups are useless unless they can be recovered. Take a weekend to simulate a likely recovery scenario. You may be surprised at all the “gotchas” when recovering data. Common stumbling blocks include not backing up database logs that are critical to recovery (ex. Exchange Server), or recovering to dissimilar hardware (Ex. RAID5 on a different controller).