« Impact 2006: Wednesday Session 5: Scalable Reliable Vista Infrastructure | Main | Impact 2006: Thursday Session 2: Capture the Classroom Experience with Apreso »

July 13, 2006

Impact 2006: Thursday Session 1: Implementing a WebCT Vista Disaster Recovery Environment

This session discusses Perdue University's disaster recovery strategy for WebCT Vista. They have developed a system that is supposed to protect them from complete loss of the building containig their WebCT servers.

This session contained a lot of technical information about exact hardware used, etc. They said the presentation will be online at a later date, so I refer people to that presentation for detials.

For an idea of the scope, their system is running on Vista 3 on sun/solaris and Oracle. Their server typically has between 3000 and 4000 course sections.

Their discussion emphasized the importance of a good disaster recovery strategy. More than just relying on backups. After all, with a site as large as theirs, it could take a week to restore the courses from backup. Faster more efficient recovery strategies are required.

Some of the main points to ponder:

  • Redundancy. They have redundant load balancers, redundant application servers, etc.
  • They have three environments, DEV, QA, and Production. The production servers are stored in a separate location about a mile away from the other servers. They are set up so they can quicly restore the database on the test / QA systems to have them become an emergency production server if necessary.
  • They pointed out there are two types of failure. Systems failure, and data failure.
  • For systems failure,there is typical backup and restore procedures. They examined a variety of database recovery systems. They like RAC, but that's only available in 10g which isn't available to them until they upgrade to Vista 4 application pak 1. They do support using a standby database
  • . For data failure, they examined several optoins.
    • SAN with daily BCV snapshots. This is good because they can copy the BSV snapshot to the test server to make it production. however, the snapshot is not directly recoverable.
    • SRDF Copy, they didn't use that because of no internal mirroring and not protected SAN space.
    • They opted for Oracle Data Guard using physical standby, maximum performance mode with no optional delay.They don't use max protection mode because if the standby crashes, the production server will lock waiting for it to return. The no optional delay option means they cannot easily recover if an instructor does something silly and wipes out a course.
  • They are considering some changes including:
    • Use "Max Availability" mode as a compromie between max protection and max performance.
    • Changing the "no optional delay" setting so that they can recover from user errors.
    • Developing a way to test the disaster recovery procedures. Currently tough to do without disrupting operations.
    • Possibly move disaster recovery off campus to another city.Implies they may need better network connections.
    • Explanding the use of the standby database to use in read-only mode for data mining and section archiving
  • The resovery time for disaster recovery is 24 - 48 hours, so they don't use it for power outtages, etc. They just tell people the system will be down during that time.
  • They currently have 200Gb allocated to archiv e logs. The database is 800Gb. By May, they expect this to be 1.5 Terabytes. It takes them about one week to do full backups.

Posted by kvl014 at July 13, 2006 11:36 AM