Build your professional network on facebook via our app Go to app
 
<< Prev  5 of 9 in Topic  Next >>
 
Industry : IT Services Functional Area : Application Software
Activity:  0 comments  181 views  last activity : 07 06 2010 20:18:04 +0000
 Refer 11
Share
 
 
 
Potential database system failures and Recovery PlansDeveloping a disaster recovery plan is essential for any SQL Server database environment, but writing an adequate plan is hard work. I've made starting the task a bit easier for myself by stepping through a relatively complete list of the potential database system failures. In this month's column, we'll walk through such a checklist of potential failures—from the most catastrophic failures to the more benign. The idea is to use this list as a starting point for creating your own disaster recovery plan. Although this list might cover all of the possible failures for a given site, and it might include conditions that don't apply to your site, I've found that it fits a great majority of situations.Database unavailability
The first set of failures would cause a SQL Server database to become unavailable to an application. I present them in order of the most catastrophic first.
1. Loss of data center
One reason a database could become unavailable is through the loss of your data center. Whether a major offsite installation or a closet in a small business, a natural or man-made disaster could take the entire system out.
In this scenario, you'd have to re-create your databases on different servers in a new location, from backup data that's been stored off-site. But how old would the data be? If you take nightly backup tapes to another location, your data could be up to 24 hours old. Sometimes having data that old is unaccepTable, though, and sending backup media to off-site storage more than once a day is too expensive. One solution is to get the data to another location—via log shipping or replication, for example. It goes without saying that your alternate location shouldn't be on the same power grid or earthquake fault, in the same floodplain, and so on as your primary site.2. Loss of network switch
Chances are the database server is on a single network switch. If that switch goes down, it will be impossible to access the database. Some data centers provide redundant network switches to deal with this scenario.
3. Loss of an entire database server and its disk system
Databases might become unavailable because the entire database server, meaning both its CPU and disk subsystem, is lost. In this scenario, the database with its configuration, data, and relationship to other database servers would all have to be re-created. Typical events that might cause this include power loss, fire, or other physical damage to the server.In this case, if transaction logs or replicated data exists on other servers, a new server could be reconstructed from data on site. Redundant power supplies and UPS systems are meant to protect against sudden power loss.4. Loss of CPU
The database becomes unavailable because the CPU of the system (motherboard, RAM) is lost even though the disk subsystem remains intact. Because the disk subsystem is intact, the SQL Server data files are usable.Here's where the MSCS cluster failover really shines. Loss of a single server simply triggers a failover to another node on the cluster, and all data on disk is preserved. Other options include retrieving the data files somehow and 1) applying them to another server; or 2) promoting a secondary log shipping server or replication subscriber to the role of primary server. I've also heard of having a second server identically configured such that when the drives from the failed server are inserted into the secondary server, it just starts up—immediately recognizing the data files on those drives.5. Loss of local disk subsystem
The database might become unavailable because of a failure of the local disk subsystem. The database would have to be restored onto a fresh set of disks or another server. Typical events include disk drive failures, multiple disk failures on RAID, and controller failures. Generally, RAID 5 and 0+1 systems can handle up to two disk failures, but on RAID 0+1, if two disks that are mirrored fail, the entire volume must be rebuilt.
In this case, you're only protected for as much data as has been backed up or copied off the server's disk drives—either through log shipping, replication, or file copies. Again, if you're on an MSCS cluster, a local disk system failure will trigger a failover to another node, and only uncommitted data will be lost.6. Loss of MSCS cluster external disk subsystem
If you lose an MSCS failover cluster's external disk subsystem, you'll have to reconstruct the database from backups stored somewhere other than on the server or promote a secondary log shipping or replication subscriber with current data.
7. Loss of SAN disk subsystem
The database might become unavailable because it loses external Storage Area Network (SAN) storage. Typical events include multiple disk drive failures, SAN component failures, or perhaps power loss to the SAN.
Given how complex SAN software can be, this is always a possibility. The database will have to be restored from backups, or from copies of data files on some external location. Again, a secondary server could reduce recovery time considerably here.8. Loss of network connectivity to the server
The database becomes unavailable because network connectivity to the server is lost. Typically, this would be due to a malfunctioning network card. Database servers often contain more than one network switch to deal with this possibility.
9. Loss of server OS
If the server hardware is functional but the operating system needs to be repaired or reinstalled, SQL Server will have to be reinstalled as well, and the database will clearly be unavailable during that period. On a failover cluster, failover to the secondary node must take place to keep the database available.
10. Loss of SQL Server
The SQL Server might fail, making the database unavailable. In this case, the server hardware and operating system are functional, but SQL Server must be reinstalled. Typically, this would be due to a bad SQL Server installation, odd access violations, or memory leaks. Bear in mind that service pack upgrades also make SQL Server unavailable for the duration of the upgrade.
11. Loss of logins and permissionsOccasionally, SQL Server will be intact, but the expected logins and their permissions fail. This could occur due to administrative error of some kind, and a script to repair the logins and permissions that could be applied quickly would be a good solution.12. Loss of database
The SQL Server might be intact, but the database might become suspect or otherwise unavailable. Typically, this would be due to some kind of data corruption, but a malicious user or developer might have actually dropped the database. This is as serious as losing the server, and the database would have to be re-created from backup media or replaced by a warm standby server.
Loss of database usability
In the remaining set of failures, the database is available, but for some reason unusable.13. Loss of database Tables
In this scenario, the SQL Server database can be accessed, but the Tables aren't there or are unavailable for some reason. Again, the database might have to be re-created from backup media, or replaced by a warm standby server.
14. Loss of data in tables
Suppose the Tables are present and can be queried, but data has been lost. Typically, this would be caused by accidental deletes or truncations of data, or bugs in the client software that mangle the data.
In this case, you might have to restore the database to a certain point in time, before the data was lost. The best way to do this is to restore transaction log backups in order to find when the error occurred, and then restore to just that point in time. Note that neither a failover cluster nor a replication subscriber will help here, but a log shipping standby with a sufficient delay before applying transaction logs could help in detecting when the error occurred.15. Loss of data feed
The Tables have data, but it's no longer current because a data feed of some kind has failed. This is still a failure, even though it might be outside of SQL Server as such. What's challenging here is to set up a mechanism for detecting this kind of failure.
16. Loss of replication
In this scenario, data is in Tables but replication is no longer updating a subscriber. If your server is the subscriber, it might be that the distribution server has failed. If your server is the publisher, perhaps replication has failed for some reason, and the distribution server is malfunctioning. In either case, you need to plan for methods of detecting the outage (say, from alerts) and procedures for restoring replication.
17. Loss of performance
The database might be available and have good data, but poor system performance might make the database unusable. For example, indexes are lost or corrupt, statistics are out of data, or a lot of blocking and deadlocking occur. Even though no data is lost, the problems must be addressed immediately.
18. Loss of data quality
DBAs and testers don't often consider this issue. What if the database can be queried successfully but the data is of poor quality due to inconsistency, age, or incompleteness? These problems could result from client software bugs, poor database design, or various user errors. In any case, a method for sanity-checking the data can be helpful in detecting these kinds of errors.
 
 
0 comments on "Potential database system failures and Recovery Plans"
Add your comment on "Potential database system failures and Recovery Plans"

Rate:
Submit
Leading Recruitment firm
ETL.Ab Initio jobs in Pune
Cognos jobs in Pune
Cognos jobs in Kolkata
Viewers also viewed
Dynamic Management Views   The dynamic management views (DMV’s) in SQL Server 2005 are designed...
 
11 referals 6 comments, 1828 views
Business schools use the medium of case studies as a learning instrument. But, the case studies...
 
120 referals 5 comments, 45 views
Though media vs Self
 
157 referals 105 arguments, 3005 views
more...  
Recent Knowledge (49)
Isn't it amazing, that Indian Government has miserably failed to handle a major issue like this,...
 
0 referals 2 comments, 153 views
We know that Ecommerce is the most happening trend of the day. After the emergence of ecommrce,...
109 referals 1 comments, 53 views
Yes they are going to be here.. and the first destination is Bangalore... Google cars and trikes...
 
1339 referals 12 comments, 443 views
more...  
More From Author
Anyone statement which cause some uncomfort to common man or mass of the people, nothing wrong in saying Sorry. She is good atleast said sorry. Raj Thakre, Rahul Gandhi, Digvijay Singh and many other politicians never say sorry for their mistakes.
Thanks for sharing the article. Good one
@Munshi: Please don't blame RSS for everything. The sounth India treats all north Indians as aliens. Tamilnadu people know Hindi language but feel shame to speak. They treat any others as outsiders... This all haterated are growing for the benifits of...
more...