Starfire Cluster - Central Intranet Server

John Chan

Uninterrupted service provision is an important goal of every Data Centre. It is difficult to achieve simply because we cannot totally prevent service failures to happen. It is therefore crucial to go one step further. If a service fails, how fast can it be resumed?

Starting from this academic year when you use the Unix server or one of the Intranet services, you might not have noticed any differences in the backend server that provides these services. However, when this server fails, which we hope will not happen, the services will be resumed more rapidly than before. This is where enhancement has been put in the backend server and is completely transparent to the end- users for most services.

Recently, another SUN Starfire server has been acquired which will allow the setting up of more system domains. As mentioned in the Dec. 1997 issue of the Network Computing, the Starfire has the spectacular feature of setting up one or more system domains and each domain can be viewed as a separate entity or "machine" with its own processors, memory, and storage spaces. The Starfire itself has already built in a lot of reliability, availability, and serviceability (RAS) features simply for reducing service downtime to the minimum. Despite all these, there is still the possibility of hardware or software failure causing the services to be disrupted. Once failure occurs, it is a matter of how fast the services can be resumed. It is this that the concept of clustered services is brought into play.

SUN Cluster Concept

All of the RAS features pertain only to a single machine. If the availability of services is dependent solely on the uptime of this machine, all of the services will become unavailable if a single fault surpasses all the RAS features and crashes the system. If there is a way that even when the machine fails, the services can still be provided by another machine, the single node failure problem can be overcome. This is the main purpose of setting up a clustered service. By clustering, one or more machines or nodes are joined together so as to provide some sort of service load balancing and redundancy. In general, a clustering of nodes provides the following benefits: continuous availability of services, tolerance of software and hardware crashes, automatic machine failure detection and recovery, and on-line serviceability (for example, taking a node off-line for repair).

The clustering features that are provided by SUN servers include the redundancy and load balancing capabilities, and the service failover capabilities (High Availability or HA cluster).

Unix Cluster on Campus

By using the appropriate software and hardware and by joining one of the domains in the first Starfire server with another domain of equal capacity of the second Starfire, a Unix HA cluster can be set up. Under this configuration, most of the backend services that support the Academic Unix environment and the Intranet services can be clustered in such a way that when one of the domains fails, the other domains will take up all of these services. With HA cluster, the downtime may be reduced from hours to a few minutes or even a few seconds. The services that are clustered in this way include the Mail service, the DNS service, the NFS service, the License Manager service, the Oracle Database service, the Oracle Web service, and the Netscape Web service.

The Unix HA cluster consists of two Starfire system domains housed in separate Starfire servers. Each domain consists of 8 UltraSparc processors, 1 GB of main memory, and ample disk storage. The disk drives, controllers, and paths are mirrored for redundancy. The domains themselves are each equipped for redundancy by its own RAS features. The two domains are joined together using 2 private network interfaces and cables forming a “heartbeat” environment. Using the heartbeat, each domain will probe the other at a pre-defined interval. If the other domain does not respond within that period, it is assumed to have failed and the failover process will be initiated. The running domain will take up all services that are originally offered by the failed domain. For users using stateless connections such as NFS service, they will see only a slight application service delay and will not lose any work in process as a result of the failover. For users using stateful connection such as telnet service, they will have to log back into the recovery domain once the takeover is complete. Each domain is initially configured to provide different services so as to avoid node idling. This has also the effect of load balancing all of services among the two domains.

During the summer time and at the time of printing, the CSC has already started to migrate the services that are provided by existing Intranet servers to this new HA cluster. It is expected that this clustering service can be put up to service at the start of the new school year. (Reference: White papers on Sun Enterprise Clusters from Sun Microsystems http://www.sun.com/clusters/wp.html).

[Issue No. 16]

Computing Services Centre
City University of Hong Kong
ccnetcom@cityu.edu.hk