Uninterrupted service provision is an important goal of every Data Centre.
It is difficult to achieve simply because we cannot totally prevent service failures to happen. It is therefore crucial to go one step further. If a service fails, how fast can it be resumed?
Starting from this academic year when you use the Unix server or one of the Intranet services, you might not have noticed any differences in the backend server that provides these services. However, when this server fails, which we hope will not happen, the services will be resumed more rapidly than before. This is where enhancement has been put in the backend server and is completely transparent to the end- users for most services.
Recently, another SUN Starfire server has been acquired which will allow the setting up of more system domains. As mentioned in the Dec. 1997 issue of the Network Computing, the Starfire has the spectacular feature of setting up one or more system domains and each domain can be viewed as a separate entity or "machine" with its own processors, memory, and storage spaces. The Starfire itself has already built in a lot of reliability, availability, and serviceability (RAS) features simply for reducing service downtime to the minimum. Despite all these, there is still the possibility of hardware or software failure causing the services to be disrupted. Once failure occurs, it is a matter of how fast the services can be resumed. It is this that the concept of clustered services is brought into play.
SUN Cluster Concept
All of the RAS features pertain only to a single machine. If the availability of services is dependent solely on the uptime of this machine, all of the services will become unavailable if a single fault surpasses all the RAS features and crashes the system. If there is a way that even when the machine fails, the services can still be provided by another machine, the single node failure problem can be overcome. This is the main purpose of setting up a clustered service. By clustering, one or more machines or nodes are joined together so as to provide some sort of service load balancing and redundancy. In general, a clustering of nodes provides the following benefits: continuous availability of services, tolerance of software and hardware crashes, automatic machine failure detection and recovery, and on-line serviceability (for example, taking a node off-line for repair).
The clustering features that are provided by SUN servers include the redundancy and load balancing capabilities, and the service failover capabilities (High Availability or HA cluster).
Unix Cluster on Campus
By using the appropriate software and hardware and by joining one of the domains
in the first Starfire server with another domain of equal capacity of the second
Starfire, a Unix HA cluster can be set up. Under this configuration, most of the
backend services that support the Academic Unix environment and the Intranet services
can be clustered in such a way that when one of the domains fails, the other domains
will take up all of these services. With HA cluster, the downtime may be reduced from
hours to a few minutes or even a few seconds. The services that are clustered in this
way include the Mail service, the DNS service, the NFS service, the License Manager
service, the Oracle Database service, the Oracle Web service, and the Netscape Web
service.
The Unix HA cluster consists of two Starfire system domains housed in
separate Starfire servers. Each domain consists of 8 UltraSparc processors,
1 GB of main memory, and ample disk storage. The disk drives, controllers,
and paths are mirrored for redundancy. The domains themselves are each
equipped for redundancy by its own RAS features. The two domains are
joined together using 2 private network interfaces and cables forming a
“heartbeat” environment. Using the heartbeat, each domain will probe the
other at a pre-defined interval. If the other domain does not respond
within that period, it is assumed to have failed and the failover process
will be initiated. The running domain will take up all services that are
originally offered by the failed domain. For users using stateless
connections such as NFS service, they will see only a slight application
service delay and will not lose any work in process as a result of the
failover. For users using stateful connection such as telnet service,
they will have to log back into the recovery domain once the takeover is
complete. Each domain is initially configured to provide different services
so as to avoid node idling. This has also the effect of load balancing all
of services among the two domains.
During the summer time and at the time of printing, the CSC has already
started to migrate the services that are provided by existing Intranet
servers to this new HA cluster. It is expected that this clustering
service can be put up to service at the start of the new school year.
(Reference: White papers on Sun Enterprise Clusters from Sun Microsystems
http://www.sun.com/clusters/wp.html).
|