« Impact 2006: Wednesday Session 3: Empowering Vista with In House Tools and Technologies | Main | Impact 2006: Wednesday Session 5: Scalable Reliable Vista Infrastructure »
July 12, 2006
Impact 2006: Wednesday Session 4: Tips for Managing WebCT 6
I thought this session was going to talk about operational tips for service managers and user support staff. However, it was aimed more at system administrators and DBAs who were planning for CE 6 implementation. Denise took courses like this last year in San Francisco.
Most of the talk was boring techie talk. However, there were a few things I took away from this to discuss with Denise.
- They really recommend having a test system running on the same hardware configuration as production. One reason is to have a spare hardware box that can easily be swapped for the production hardware to reduce downtime. We plan to have a test server available, but I have to discuss with Denise exactly how that system will operate and on what hardware.
- They say one application server will handle about 500 concurrent users. WebCT Support Services can help analyze our 4.1 system to see how many concurrent users we have.
- The WebCT CE 6 application server uses about 1Gb RAM plus 1.5 Gb of disk swap space. This does not change as we get larger boxes. So, a 4Gb machine and a 20Gb machine will operate the same. The best way to add capacity is to cluster more servers, not to get a bigger server. This also improves redundancy.
- A typical minimum recommended system for production is one application server, and one database server.
- A minimum recommended system for a clustered environment requires five servers. Two application servers, an administrative server to manage the cluster, a load balancer (installed on a server or dedicated hardware) and a Database server. This should also be repeated for test, and if necessary development servers. They recommend that all these servers operate on the same hardware platforms. That way, if one box crashes, you can take one of the test boxes, or one of the application server boxes, and use it to replace the broken box if you have to.
- The administrative box is required to properly run the cluster. If the box fails, the cluster can run up to 24 hours before the box must be replaced. However, when it is replaced, you must reboot the entire cluster. Again, they emphasize keeping a spare box to swap in to minimize downtime. In an emergency, we could bring the administrator machine up as a separate process on one of the nodes (with its own separate IP address) until we can get a new admin box. However, that would require a reboot of the cluster. It is also not recommended, because if the node containing the admin box crashes, the entire cluster crashes.
- If a node in the cluster crashes, you can replace the box and only have to restart that one node, not the entire cluster.
- The database is currently the single point of failure. If the database fails, there's no way to easily cluster it to provide a redundant service. This means that backup strategies for the server is vital. We must be able to restore the database quickly. I need to discuss this more thoroughly with Denise and the DBAs so i thoroughly understand the options available to provide quickest backup recovery.
- You still need to reboot clusters after applying service paks. They recommend setting up a routine maintenance window.
- Regarding application pak 1 and Oracle 10g. If you have already installed CE 6 on Oracle 9i, you are stuck with Oracle 9i. You have to install each service pak (1, 2, and 3) then the application pak. If you want to upgrade to Oracle 10g, you have to do a completely new install. Useful for us to consider when moving to production. We may want to reinstall our test server. Or we may wait for a migration path to Oracle 10g. Must discuss with Denise.
- One possible option for a test server is to install all the pieces of a cluster on one test server. Saves us from having to purchase a bunch of hardware, but also reduces redundancy if we need to take a server to fix production, because we would be taking our entire test system.
Posted by kvl014 at July 12, 2006 09:50 PM