System-Level Fault Tolerance (Clustering/Network Load Balancing)

Windows Server 2003 provides several methods of improving system- or server-level fault tolerance by using a few of the services included in the Enterprise and Datacenter platforms. This chapter covers system-level fault tolerance using Windows Server 2003 network load balancing (NLB) and the Microsoft Cluster Service (MSCS). (From Microsoft Windows Server 2003 Unleashed, second edition, by Rand Morimoto, et al. Sams Publishing, 2004, ISBN: 0672326671.)

Contributed by
Rating: 4 stars4 stars4 stars4 stars4 stars / 17
September 22, 2004
Rate this Article:
MEH MEH++


SEARCH ASP FREE
TOOLS YOU CAN USE

advertisement

MorimotoIn many of today's business environments, using computer applications and networking services has become critical in conducting day-to-day business functions efficiently. The word downtime has become taboo in situations in which an unstable application or a failed server can greatly impact employee productivity or cost organizations money. Deploying fault-tolerant servers to provide reliable access to critical applications, user data, and networking services is required when unexpected downtime is unacceptable.

Windows Server 2003 provides several methods of improving system- or server-level fault tolerance by using a few of the services included in the Enterprise and Datacenter platforms. Chapter 30, "File System Fault Tolerance (DFS)," discussed file-level fault tolerance, including the Distributed File System (DFS) and volume shadow copies. This chapter covers system-level fault tolerance using Windows Server 2003 network load balancing (NLB) and the Microsoft Cluster Service (MSCS). These built-in clustering technologies provide load-balancing and failover capabilities that can be used to increase fault tolerance for many different types of applications and network services. Each of these clustering technologies is different in many ways. Choosing the correct type of clustering depends on the applications and services that will be hosted on the cluster.

Windows Server 2003 technologies such as NLB and MSCS improve fault tolerance for applications and network services, but before these technologies can be leveraged effectively, basic server stability best practices must be put in place.

This chapter focuses on the policies and procedures needed to create an environment that supports a fault-tolerant network. Additionally, this chapter contains the step-by-step procedures needed to make server hardware more reliable through the successful implementation of NLB and MSCS.

Building Fault-Tolerant Systems

Building fault-tolerant computing systems consists of carefully planning and configuring server hardware and software, network devices, and power sources. Purchasing quality server and network hardware is a good start to building a fault-tolerant system, but the proper configuration of this hardware is equally important. Also, providing this equipment with stable line power that is backed up by a battery or generator adds fault tolerance to the network. Last but not least, proper tuning of server operating systems helps enhance availability of network services such as file shares, print servers, network applications, and authentication servers.

Using Uninterruptible Power Supplies

Connecting line power to server and network devices through uninterruptible power supplies (UPSs) not only provides conditioned incoming power by removing voltage spikes and providing steady line voltage levels, but it also provides battery backup power. When line power fails, the UPS switches to battery mode, which should provide ample time to shut down the server or network device without risk of damaging hardware or corrupting data. UPS manufacturers commonly provide software that can send network notifications, run scripts, or even gracefully shut down servers when power thresholds are met. One final word on power is that most computer and network hardware manufacturers provide device configurations that incorporate redundant power supplies designed to keep the system powered up in the event of a single power supply failure.

During power outages, many system administrators find out which critical devices are not connected to a UPS, and the race begins to shut down and shift power from non-critical devices. To avoid these situations, administrators need to perform regular inspections of critical hardware devices in server rooms and network closets to ensure that all necessary servers, network routers, switches, hubs, and firewalls are backed by battery power. When power to a server fails and the battery provides only a few minutes for users to save data and close connections to reduce the chance of data corruption, it is essential for the network to remain available.

This chapter is from Microsoft Windows Server 2003 Unleashed, by Rand Morimoto, et al. (Sams Publishing, 2004, ISBN: 0672326671). Check it out at your favorite bookstore today.

Buy this book now.

Choosing Networking Hardware for Fault Tolerance

Network design can also incorporate fault tolerance by creating redundant network routes and by utilizing technologies that can group devices together for the purposes of load balancing and device failover. Load balancing is the process of spreading requests across multiple devices to keep individual device load at an acceptable level. Failover is the process of moving services offered on one device to another upon device failure, to maintain availability.

Networking hardware such as Ethernet switches, routers, and network cards can be configured to provide fault-tolerant services through load-balancing applications or through features within the network device firmware or operating system. Refer to the manufacturer's documentation to research fault-tolerant configurations available in your organization's network devices.

For more robust redundant network card configurations, third-party hardware vendors have created network card teaming and network card fault-tolerant software applications. These technologies allow client/server communication to fail over from one network interface card (NIC) to another in the event of an NIC failure. Also, they can be configured to balance network requests across all the NICs in one server simultaneously. Refer to the particular hardware manufacturer's documentation to find out whether a compatible teaming application is available for your network card.

Note: Windows Server 2003 network load balancing does not allow multiple NICs on the same server to participate in the same NLB cluster.

Selecting Server Storage for Redundancy

Server disk storage usually contains user data and/or operating system files that make it a critical server subsystem that should incorporate fault tolerance. There are a few different ways to create fault-tolerant disk storage for the Windows Server 2003 operating system. The first is creating Redundant Arrays of Inexpensive Disks (RAID) using disk controller configuration utilities, and the second is creating the RAID disks using dynamic disk configuration from within the Windows Server 2003 operating system.

Using two or more disks, different RAID-level arrays can be configured to provide fault tolerance that can withstand disk failures and still provide uninterrupted disk access. Implementing hardware-level RAID configured and stored on the disk controller is preferred over the software-level RAID configurable within Windows Server 2003 Disk Management because the Disk Management and synchronization processes in hardware-level RAID are offloaded to the RAID controller. With Disk Management and synchronization processes offloaded from the RAID controller, the operating system will perform better overall.

Another good reason to provide hardware-level RAID is that the configuration of the disks does not depend on the operating system, which gives administrators greater flexibility when it comes to recovering server systems and performing upgrades. Refer to Chapter 22, "Windows Server 2003 Management and Maintenance Practices," for more information on ways to create RAID arrays using Windows Server 2003 Disk Management. Also, refer to the manufacturer's documentation on creating RAID arrays on your RAID disk controller.

Improving Application Reliability

An application's reliability is greatly dependent on the software code and the hardware it is running on. Administrators can make applications more reliable on Windows Server 2003 by running legacy client/server applications in lower application compatibility modes to improve overall reliability; they do so by isolating each application instance to a separate memory location. If one instance crashes, the remaining instances and the server itself remain available and unaffected. Reliability for client/server-based applications written for Windows Server 2003 can be improved by deploying these applications on clusters. Windows Server 2003 Enterprise and Datacenter servers provide two different clustering technologies that enhance application reliability by providing server load balancing and failover capabilities.

This chapter is from Microsoft Windows Server 2003 Unleashed, by Rand Morimoto, et al. (Sams Publishing, 2004, ISBN: 0672326671). Check it out at your favorite bookstore today.

Buy this book now.

Examining Windows Server 2003 Clustering Technologies

Windows Server 2003 provides two clustering technologies, which are included on the Enterprise and Datacenter server platforms. Clustering is the grouping of independent server nodes that are accessed and viewed on the network as a single system. When an application is run from a cluster, the end user can connect to a single cluster node to perform his work, or each request can be handled by multiple nodes in the cluster. In cases where data is read-only, the client may request data and receive the information from all the nodes in the cluster, improving overall performance and response time.

The first clustering technology Windows Server 2003 provides is Cluster Service, also known as Microsoft Cluster Service (MSCS). The Cluster Service provides system fault tolerance through a process called failover. When a system fails or is unable to respond to client requests, the clustered services are taken offline and moved from the failed server to another available server, where they are brought online and begin responding to existing and new connections and requests. Cluster Service is best used to provide fault tolerance for file, print, enterprise messaging, and database servers.

The second Windows Server 2003 clustering technology is network load balancing (NLB) and is best suited to provide fault tolerance for front-end Web applications and Web sites, Terminal servers, VPN servers, and streaming media servers. NLB provides fault tolerance by having each server in the cluster individually run the network services or applications, removing any single points of failure. Certain applications—for example, Terminal Services—require a client to connect to the same server during the entire session, while clients viewing Web sites can request pages from any node in the cluster during a visit. Configuring how client/server communication is divided and balanced across the servers is dependent on the application's needs.

Note: Microsoft does not support running both MSCS and NLB on the same computer due to potential hardware sharing conflicts between the two technologies.

Reviewing Cluster Terminology

Before you can design and implement MSCS and NLB clusters, you must understand certain clustering terminology. The following list describes key terms associated with Windows Server 2003 clustering:

  • Cluster—A cluster is a group of independent servers that are accessed and viewed on the network as a single system.

  • Node—A node is an independent server that is a member of a cluster.

  • Cluster resource—A cluster resource is a network application or service defined and managed by the cluster application. Some examples of cluster resources are network names, IP addresses, logical disks, and file shares.

  • Cluster resource group—Cluster resources are contained within a cluster in a logical set called a cluster resource group, or commonly referred to as a cluster group. Cluster groups are the units of failover within the cluster. When a cluster resource fails and cannot be restarted automatically, the entire cluster group is taken offline and failed over to another available cluster node.

  • Cluster virtual server—A cluster virtual server is a cluster resource group that contains a network name and IP address resource. Virtual server resources are accessed either by the domain name system (DNS) or NetBIOS name resolution or directly from the IP address. The name and IP address remain the same regardless of which cluster node the virtual server is running on.

  • Cluster heartbeat—The cluster heartbeat is the communication that is kept between individual cluster nodes that is used to determine node status. Typically, heartbeat communication between nodes must be no longer than 500 milliseconds, or the nodes may believe that there is a failure and commence cluster group failovers.

  • Cluster quorum disk—The cluster quorum disk maintains the definitive cluster configuration data. MSCS uses a quorum disk or disks and requires continuous access to the cluster configuration data contained within it. The quorum contains configuration data defining which server nodes actively participate in the cluster, what applications and services are defined in the cluster, and the current states of the resources and the individual nodes. This data is used to determine whether a particular resource group or groups need to be failed to an available cluster node in the event of a failure on an active node. If a cluster node loses access to the quorum, the Cluster Service will fail on that node. In a typical MSCS cluster, the quorum resource is located on a shared storage device.

  • Local quorum resource—Like the quorum resource, the local quorum contains the cluster configuration data. Unlike the standard quorum device that is usually housed on a shared disk, the local quorum is kept on a node's local disk. The local quorum resource was created for single-node cluster configurations, commonly used for cluster application development and testing.

  • Majority Node Set (MNS) resource—The MNS resource is the quorum resource used for a Majority Node Set cluster. The MNS resource maintains consistent configuration data across all the nodes in the cluster. If the MNS quorum is lost, it can be recovered by "forcing the quorum" on a remaining cluster node. Refer to the Windows Server 2003 online help and look for the topic "Forcing the Quorum in a Majority Node Set Cluster."

  • Generic cluster resource—Generic cluster resources were created to define cluster-unaware applications within a cluster group. This gives the ability to fail the resource over to another node in the cluster when the active node fails. This resource is not monitored by the cluster application; therefore, application failure does not result in a restart or failover scenario. Generic cluster resources include the generic application, generic script, and generic service resources. For more information on these resources, refer to the Windows Server 2003 Help and Support tool and search for "generic cluster resources."

  • Cluster-aware application—A cluster-aware application provides a mechanism by which the Cluster Service can test the application availability to determine whether it is functioning as desired. When a cluster-aware application fails, the cluster can stop and restart the application as necessary on the same node and, if necessary, move it to another available node where it can be restarted.

  • Cluster-unaware application—A cluster-unaware application can run on a cluster, but the application itself is not monitored by the Cluster Service. This means that the cluster can fail over the application only in the event that another resource fails in the cluster group. If the application stops responding, the cluster is not aware and therefore cannot restart it. Keep in mind that there are other ways to manage cluster-unaware applications outside the cluster, and in some cases these approaches may be the only option. For more information on how to install and configure generic applications, refer to the Windows Server 2003 Help and Support and search for "generic application resource type."

  • Failover—Failover is the process of a cluster group moving from the current active node to another available node in the cluster. Failover occurs when a server becomes unavailable or when a resource in the cluster group fails and cannot recover with the failure threshold.

  • Failback—Failback is the process of a cluster group moving back to a preferred node after the preferred node resumes cluster membership. Failback must be configured within a cluster group for this to happen. The cluster group must have a preferred node defined and a failback threshold configured. A preferred node is the node you would like your cluster group to run on during regular cluster operation. When a group is failing back, the cluster is performing the same failover operation but is triggered by a server rejoining or resuming cluster operation instead of by a server or resource failure.


Note - Plan carefully when considering failback. For more information, refer to the "Configuring Failover and Failback" section later in this chapter.


This chapter is from Microsoft Windows Server 2003 Unleashed, by Rand Morimoto, et al. (Sams Publishing, 2004, ISBN: 0672326671). Check it out at your favorite bookstore today.

Buy this book now.

Active and Passive Clustering Modes

Active/Passive Clustering Mode

Active/passive clustering occurs when one node in the cluster provides clustered services while the other available node or nodes remain online but do not provide services or applications to end users. When the active node fails, the cluster groups previously running on that node are failed over to the passive node, causing the node's participation in the cluster to go from passive to active state to begin servicing client requests.

This configuration is usually implemented with database servers that provide access to data that is stored in only one location and is too large to replicate throughout the day. One advantage of Active/Passive mode is that if each node in the cluster has similar hardware specifications, there is no performance loss when a failover occurs. The only real disadvantage of this mode is that the passive node's hardware resources cannot be leveraged during regular daily cluster operation.


Note - Active/passive configurations are a great choice for keeping cluster administration and maintenance as low as possible. For example, the passive node can be used to test updates and other patches without directly impacting production. However, it is nonetheless important to test in an isolated lab environment or, at a minimum, during after hours or predefined maintenance windows.


Active/Active Clustering Mode

Active/active clustering occurs when one instance of an application runs on each node of the cluster. When a failure occurs, two or more instances of the application can run on one cluster node. The advantage of Active/Active mode over Active/Passive mode is that the physical hardware resources on each node are used simultaneously. The major disadvantage of this configuration is that if you are running each node of the cluster at 100% capacity, in the event of a node failure, the remaining active node assumes 100% of the failed node's load, greatly reducing performance. As a result, it is critical to monitor server resources at all times and ensure that each node has enough resources to take over the other node's responsibilities if the other should failover.

This chapter is from Microsoft Windows Server 2003 Unleashed, by Rand Morimoto, et al. (Sams Publishing, 2004, ISBN: 0672326671). Check it out at your favorite bookstore today.

Buy this book now.

Choosing the Right Clustering Technology

For these fault-tolerant clustering technologies to be most effective, administrators must carefully choose which technology and configuration best fits their application or network service needs. NLB is best suited to provide connectivity to TCP/IP-based services such as Terminal Services, Web sites, VPN services, and streaming media services. This provides scalability, and the amount of redundancy it provides depends on the number of systems in the NLB set. The Windows Server 2003 Cluster Service provides server failover functionality for mission-critical applications such as enterprise messaging, databases, and file and print services.

Although Microsoft does not support using both NLB and MSCS on the same server, multi-tiered applications can take advantage of both technologies by using NLB to load-balance front-end application servers and using MSCS to provide failover capabilities to back-end databases that contain data too large to replicate during the day.

Microsoft Cluster Service

Microsoft Cluster Service (MSCS) is a clustering technology that provides system-level fault tolerance by using a process called failover. Cluster Service is used best to provide access to resources such as file shares, print queues, email or database services, and back-end applications. Applications and network services defined and managed by the cluster, along with cluster hardware including shared disk storage and network cards, are called cluster resources. Cluster Service monitors these resources to ensure proper operation.

When a problem is encountered with a cluster resource, Cluster Service attempts to fix the problem before failing it completely. The cluster node running the failing resource attempts to restart the resource on the same node first. If the resource cannot be restarted, the cluster will fail the resource, take the cluster group offline, and move it to another available node, where it can then be restarted.

Several conditions can cause a cluster group to fail over. Failover can occur when an active node in the cluster loses power or network connectivity or suffers a hardware failure. Also, when a cluster resource cannot remain available on an active node, the resource's group is moved to an available node, where it can be started. In most cases, the failover process is either noticed by the clients as a short disruption of service or no disruption at all.

To avoid unwanted failover, power management should be disabled on each of the cluster nodes in the motherboard BIOS, on the network interface cards, and in the Power applet in the operating system's Control Panel. Power settings that allow a monitor to shut off are okay, but the administrator must make sure that the disks are configured to never go into standby mode.

Cluster nodes can monitor the status of resources running on their local system, and they can also keep track of other nodes in the cluster through private network communication messages called heartbeats. The heartbeats are used to determine the status of a node and send updates of cluster configuration changes to the cluster quorum resource.

The quorum resource contains the cluster configuration data necessary to restore a cluster to a working state. Each node in the cluster needs to have access to the quorum resource; otherwise, it will not be able to participate in the cluster. Windows Server 2003 provides three types of quorum resources, one for each cluster configuration model.

Using Network Load Balancing

The second clustering technology provided with the Windows Server 2003 Enterprise and Datacenter server platforms is network load balancing. NLB clusters provide high network performance and availability by balancing client requests across several servers. When client load increases, NLB clusters can easily be scaled out by adding more nodes to the cluster to maintain or provide better response time to client requests.

Two great features of network load balancing are that no proprietary hardware is needed, and an NLB cluster can be configured and up and running literally in minutes. NLB clusters can grow to 32 nodes, and if larger cluster farms are necessary, DNS round robin or a third-party solution should be investigated to meet this larger demand.

One important point to remember is that within NLB clusters, each server's configuration must be updated independently. The NLB administrator is responsible for making sure that application configuration and data are kept consistent across each node. Applications such as Microsoft's Application Center can be used to manage content and configuration data among those servers participating in the NLB cluster. To install network load balancing, proceed directly to the "Installing Network Load Balancing Clusters" section later in this chapter.

This chapter is from Microsoft Windows Server 2003 Unleashed, by Rand Morimoto, et al. (Sams Publishing, 2004, ISBN: 0672326671). Check it out at your favorite bookstore today.

Buy this book now.

Implementing Cluster Service

After an organization decides to cluster an application or service using Cluster Service, it must then decide which cluster configuration model best suits its needs.

MSCS can be deployed in three different configuration models that will accommodate most deployment scenarios and requirements. The three configuration models include the single-quorum device cluster, single-node cluster, and the majority node set cluster. The typical and most common cluster deployments are configured using the single-quorum device cluster.

The Single-Quorum Device Cluster

The single-quorum device cluster configuration model is composed of two or more server nodes that are all connected to a shared storage device. In this model, only one copy of the quorum data is maintained and is housed on the shared storage device, as shown in Figure 31.1. All cluster nodes have access to the quorum data, but the quorum disk resource runs only on one node of the cluster at a time.

 Morimoto

Figure 31.1 Two-node single-quorum device cluster.

This configuration model is best suited for applications and services that provide access to large amounts of mission-critical data and require high availability. When the cluster encounters a problem on a cluster group containing a shared storage disk resource, the cluster group is failed over to the next node and made available with almost no disruption. When the cluster group is back online, all the data is once again available after a short disruption in service. Typical services deployed using this cluster configuration model include file, messaging, and database servers.

The Single-Node Cluster

The single-node cluster configuration model was created to serve many purposes. First, a single-node cluster can run solely on local disks, but it can also use shared storage. When creating a single-quorum cluster, the administrator must first create a single-node cluster but with a shared disk quorum. The single-node cluster can also use the local quorum resource, which is usually located on internal disk storage. The local quorum resource is a great benefit for cluster application development because only a single server with internal disk storage is needed to test cluster applications.

One last point to add about this model is that because there is only one node, the cluster will not use or provide failover. If the single node is down, all the cluster groups are unavailable.

The Majority Node Set Cluster

The Majority Node Set (MNS) cluster is the third configuration model and represents the future of clustering, as shown in Figure 31.2. MNS can use shared storage devices, but this capability is not a requirement. In an MNS cluster, each node maintains a local copy of the quorum device data in a specific Majority Node Set resource. Windows Server 2003 Enterprise supports up to four nodes per cluster, and Datacenter supports up to eight nodes. Because each node maintains a local copy of the quorum and a shared storage device is not necessary, MNS clusters can be deployed across a WAN in a geographically distributed environment. Windows Server 2003 supports up to two separate sites for MNS, and because the cluster IP will need to fail over across sites, the sites either need to be bridged or a virtual private network (VPN). Another viable option is having Network Address Translation (NAT) installed and configured for failover for proper IP recovery to occur. The latency between the cluster nodes for private communication must not exceed 500 milliseconds; otherwise, the cluster can go into a failed state.

Morimoto

Figure 31.2 Two-site, four-node Majority Node Set cluster.

An MNS cluster will remain up and running as long as the majority of the nodes in the cluster are available. In other words, to remain operational, more than half of the nodes must be up and running. For instance, in a four-node cluster, three nodes must remain available, or the cluster will fail. If an administrator configures a three-node cluster, two nodes must remain up and running. Both the three-node and four-node clusters can tolerate only a single node failure.

If you are considering or requiring availability provided by MNS, it is recommended to always purchase at least one additional node when planning for an MNS cluster. This node can be used in the lab for application testing, including testing patches and application updates, or it can be configured in a cold-standby state that can be added to a cluster when a single node fails.

This chapter is from Microsoft Windows Server 2003 Unleashed, by Rand Morimoto, et al. (Sams Publishing, 2004, ISBN: 0672326671). Check it out at your favorite bookstore today.

Buy this book now.

An MNS Cluster Scenario

An MNS cluster model supports geographically distributed clusters. This means that in a three-node cluster deployment, you can deploy two nodes in Site A and one node in Site B. A spare server will be kept at Site B to join the cluster if necessary. When a single node fails in Site A, the cluster remains up and running because the majority of the nodes are still running, even though they are running in separate sites. If the node in Site B fails, the cluster will remain running on the two nodes at Site A. If a major disaster or power outage is encountered at Site A, the cluster will fail because only one node is running at Site B. To bring the cluster back online, you can restore one of site A's nodes at the Site B location using the spare server. This gives you the two nodes you need to make the three-node MNS cluster operational.

In the same scenario, if you deploy a four-node cluster with two nodes at each site, a single site failure will result in the cluster failing and require an additional server to restore a third and required node. So, if you want to properly plan for a site outage using a four-node MNS cluster, you would need to have a spare server in each location, making the total six servers for a four-node cluster.

MNS is a great choice for geographically distributed clusters, but you must follow these rules to deploy the clusters properly:

  • The cluster nodes require less than a 500-millisecond response time between the private LAN adapters on each of the cluster nodes.

  • A VPN must be established between the sites to allow the clustered IP address to fail over across site boundaries while remaining accessible to clients. If the site's LANs are bridged across a WAN, this would also suffice. Also consider having redundant connections between those sites.

  • MNS can be deployed across only two sites.

  • Data other than the cluster quorum information does not automatically replicate between cluster nodes and needs to be replicated with software or replicated manually.

MNS clusters represent the future of clustering, and several developments will be made along the way to simplify installations and deployment. Microsoft recommends that MNS clusters be deployed only on hardware supported by the server and storage device vendors for use with geographically distributed MNS clusters.

Choosing Applications for Cluster Service

Many applications can run on Cluster Service, but it is important to choose those applications wisely. Although many can run on MSCS, the application might not be optimized for clustering. Work with the vendor to determine requirements, functionality, and limitations (if any). Other major criteria that should be met to ensure that an application can benefit and adapt to running on a cluster are the following:

  • Because clustering is IP-based, the cluster application or applications must use an IP-based protocol.

  • Applications that require access to local databases must have the option of configuring where the data can be stored.

    Some applications need to have access to data regardless of which cluster node they are running on. With these types of applications, it is recommended that the data is stored on a shared disk resource that will fail over with the cluster group. If an application will run and store data only on the local system or boot drive, the Majority Node Set cluster configuration, along with a separate file replication mechanism, should be considered.

  • Client sessions must be able to re-establish connectivity if the application encounters a network disruption.

    During the failover process, there is no client connectivity until an application is brought back online. If the client software does not try to reconnect and simply times out when a network connection is broken, this application may not be the best one to cluster.

Those cluster-aware applications meeting all the preceding criteria are usually the best applications to deploy in a cluster configuration. Many services built into Windows Server 2003 can be clustered and will fail over efficiently and properly. If a particular application is not cluster-aware, be sure to investigate all the implications of the application deployment on the Cluster server.


Note - If you're purchasing a third-party software package for MSCS, be sure that both Microsoft and the software manufacturer certify that it will work on a Windows Server 2003 cluster; otherwise, support will be limited when troubleshooting is necessary.


This chapter is from Microsoft Windows Server 2003 Unleashed, by Rand Morimoto, et al. (Sams Publishing, 2004, ISBN: 0672326671). Check it out at your favorite bookstore today.

Buy this book now.

Shared Storage Devices

Shared disk storage was a requirement for all previous releases of MSCS until Windows Server 2003. Now only the traditional design of a single quorum device cluster has such a requirement, but a shared storage device can be a part of any cluster configuration.

In the past, storage area networks (SANs) were used to satisfy the shared storage device requirement. The logical volumes created in the SAN device must be configured and recognized as basic disks by the Windows Server 2003 operating system. Windows Server 2003 identifies the logical volumes on the SAN by their disk signatures, and each volume is treated as a separate disk by MSCS. Currently, dynamic disks are not supported for shared disk volumes. SCSI SAN units are supported on two-node clusters, but for clusters with more than two nodes, fiber channel is the preferred method of connecting cluster nodes to the shared storage.

Using a single fiber channel, Windows Server 2003 can access both shared and nonshared disks residing on a SAN. This allows both the shared storage and operating system volumes to be located on the SAN, giving administrators the flexibility of deploying diskless servers. Of course, the SAN must support this option, and the boot drives must be assigned exclusive access for individual cluster nodes through proper disk zoning and masking. Consult SAN vendor documentation and check the Cluster HCL on the Microsoft Web site to find approved SAN devices.

The Cluster server uses a shared nothing architecture, which means that each cluster resource can be running on only one node in the cluster at a time. When a disk resource is failed over between nodes, the SAN device must be reset to accommodate the mounting of the disk on the remaining node. If the SAN device is used by more than just cluster nodes, SAN communication can be disrupted to other servers if the SAN is not configured to reset only the targeted logical unit number (LUN) as opposed to resetting the entire bus. Windows Server 2003 supports targeted LUN resets, and SAN vendor documentation should be reviewed to ensure proper zoning and masking of the SAN device.

Multipath I/O

Windows Server 2003 supports multipath I/O to external storage devices such as SANs. This allows for multiple redundant paths to external storage, adding yet another level of fault tolerance. This capability is now achieved through redundant fiber channel controller cards in each cluster node.

Volume Shadow Copy for Shared Storage Volume

The Volume Shadow Copy (VSS) service is supported on shared storage volumes. Volume Shadow Copy can take a point-in-time snapshot of an entire volume, enabling administrators and users to recover data from a previous version. The amount of disk space used for each copy can be minimal, so enabling the service can add data fault tolerance and reduce recovery time of a file or folder. Volume Shadow Copy should be tested thoroughly on a disk containing enterprise databases such as Microsoft SQL 2000 prior to implementation to ensure that it can provide fault tolerance and recoverability as required and to ensure that databases do not suffer corruption as a result of a rollback to a previous version of the database file.

Single-Quorum Cluster Scalability

The single-quorum cluster is composed of independent server nodes that all connect to a share's storage device such as a SAN. Table 31.1 specifies the minimum and maximum number of nodes and types of storage communications allowed in a single-quorum cluster.

Table 31.1 Number of Nodes Allowed in a Cluster

Operating System

Number of Nodes

Allowed Cluster Storage Device

Windows Server 2003 Enterprise Server

2, 3, 4, 5, 6, 7, or 8

SCSI, fiber channel (recommended for clusters with more than two nodes)

Windows Server 2003 Datacenter Edition

2, 3, 4, 5, 6, 7, or 8

SCSI, fiber channel (recommended for clusters with more than two nodes)

64-bit edition of Windows Server 2003 Enterprise Server

2, 3, 4, 5, 6, 7, or 8

Fiber channel

64-bit edition of Windows Server 2003 Datacenter Edition

2, 3, 4, 5, 6, 7, or 8

Fiber channel


This chapter is from Microsoft Windows Server 2003 Unleashed, by Rand Morimoto, et al. (Sams Publishing, 2004, ISBN: 0672326671). Check it out at your favorite bookstore today.

Buy this book now.

Installing Cluster Service

The Windows Server 2003 Cluster Service is installed by default. Because the service is already installed, creating a cluster does not require the installation media or a reboot. The Cluster Administrator utility can be used to create a new cluster and to manage existing clusters on local and remote nodes.

Both the GUI-based Cluster Administrator and the command-line utility Cluster.exe can be used to create and manage clusters. Both tools can effectively manage a cluster, but Cluster.exe allows an administrator to create an unattended, scripted cluster installation. Cluster.exe provides too many arguments and switches to be discussed in detail here, so refer to Help and Support from the Start menu and search for "cluster.exe." Alternatively, at a command prompt, type cluster.exe /?. Later in this chapter, in the "Installing the First Node in the Cluster" section, basic Cluster.exe commands will be outlined.

A recommendation for cluster nodes is to have multiple network cards in each node so that one card can be dedicated to internal cluster communication (private network) while the other can be used only for client connectivity (public network) or for both public and private communication (mixed network). Cluster nodes equipped with only one network card must run the card in Mixed Network mode.

During a cluster installation, if shared storage is discovered, Cluster Service will default to installing the quorum resource on the smallest basic partition on the device. If no shared storage is available, a local or an MNS quorum will be created.

Working Through the Cluster Pre-Installation Checklist

Be sure to check the following before installing Cluster Service:

  1. Gather the network name for the cluster.

  2. Gather all necessary IP addresses for the cluster and for each network card in the cluster node.

  3. Before booting up the first server, connect, configure, and turn on all external storage devices if any are being used. You should also have the appropriate drivers that may be required for this external storage device.

  4. If multiple network cards are being used, rename the connections using easily identifiable names, such as Cluster Private Nic and Cluster Mix Nic, similar to what is shown in Figure 31.3.

    Morimoto

    Figure 31.3
     Multiple network adapter configuration.

  5. Create a Cluster Service account in the domain in which you are installing the cluster. It needs to be only a standard user account, but the password should never expire. During the cluster installation, the account will be given Local Administrator rights on the cluster nodes and will be given a few rights in the domain, such as Add Computer Accounts to the Domain.

  6. Choose your cluster configuration mode and choose the correlating quorum type during the cluster installation.

This chapter is from Microsoft Windows Server 2003 Unleashed, by Rand Morimoto, et al. (Sams Publishing, 2004, ISBN: 0672326671). Check it out at your favorite bookstore today.

Buy this book now.

Installing the First Node in the Cluster

When a cluster is built, the first system to be built is considered the first node in the cluster. This system needs to be initially prepared as the primary system. When the primary system has been configured, additional nodes can be added to the cluster.

To install the first node in the cluster, follow these steps:

  1. Shut down both the cluster nodes and shared storage devices.

  2. Connect cables as required between the cluster nodes and shared storage devices.

  3. Connect each node's NICs to a network switch or hub using appropriate network cables.

  4. If a shared storage device is being used, power on the shared storage device and wait for the startup sequence to complete.

  5. Start the first node in the cluster. If a shared disk will be used, configure the adapter card's ID on each cluster node to a different number. For example, use ID 6 for node 1 and ID 7 for node 2.

  6. Log on with an account that has Local Administrator privileges.

  7. If the server is not a member of a domain, add the server to the correct domain and reboot as necessary.

  8. Configure each network card in the node with the correct network IP address information.

    Network cards that will be used only for private communication should have only an IP address and subnet mask configured. Default Gateway, DNS, NetBIOS-related services (such as Client for Microsoft Networks), and WINS should not be configured. Also, uncheck the Register This Connection's Address in DNS box, as shown in Figure 31.4, on the DNS tab of the Advanced TCP/IP Settings page.

    Morimoto

    Figure 31.4 TCP/IP DNS configuration settings.

    For network cards that will support public or mixed networks, configure all TCP/IP settings as they would normally be configured.

  9. If you're not already logged in, log on to the server using an account that has Local Administrator privileges.

  10. Click Start, Administrative Tools, Cluster Administrator, as shown in Figure 31.5.

    Morimoto

    Figure 31.5 Launching the Cluster Administrator utility.

  11. When the Cluster Administrator opens, choose the Create New Cluster action and click OK.

  12. Click Next on the New Server Cluster Wizard Welcome screen to continue.

  13. Choose the correct domain from the Domain pull-down menu.

  14. Type the cluster name in the Cluster Name text box and click Next to continue.

  15. Type the name of the cluster node and click Next to continue. The wizard defaults to the local server, but clusters can be configured remotely. The cluster analyzer analyzes the node for functionality and cluster requirements, as shown in Figure 31.6. A detailed log containing any errors or warnings that can stop or limit the installation of the Cluster server is generated.

  16. Review the log and make changes as necessary; then click Re-analyze or click Next to continue.

    Morimoto

    Figure 31.6
    Cluster analyzer utility operations.

  17. Enter the cluster IP address and click Next.

  18. Enter the Cluster Service account name and password and choose the correct domain. Click Next to continue.


    Note - The Cluster Service account needs to be only a regular domain user, but specifying this account as the Cluster Service gives this account Local Administrator privileges on the cluster node and also delegates a few user rights, including the ability to act as a part of the operating system and add computers to the domain.


  19. On the Proposed Cluster Configuration page, review the configuration and choose the correct quorum type by clicking the Quorum button, as shown in Figure 31.7.

    Morimoto

    Figure 31.7
    Choosing the cluster quorum configuration.

    • To create an MNS cluster, click the Quorum button on the Proposed Cluster Configuration page, choose Majority Node Set, and click OK.

    • If a SAN is connected to the cluster node, the Cluster Administrator will automatically choose the smallest basic NTFS volume on the shared storage device. Make sure the correct disk has been chosen and click OK.

    • If you're configuring a single-node cluster with no shared storage, choose the Local Quorum resource and click OK.

  20. Click Next to complete the cluster installation.

  21. After the cluster is created, click Next and then Finish to close the New Server Cluster Wizard and return to the Cluster Administrator.

Alternatively, you can create a cluster by using Cluster.exe. You can use the following to create a cluster called cluster1 on the server named Server1. This example uses a Cluster Service account called clustersvc@companyabc.com, using the 192.168.100.10 IP address and a class C subnet mask. Also the network card is renamed Cluster Mix Nic at a command prompt. The command is as follows:

Cluster.exe /CLUSTER:cluster1 /CREATE /NODE:server1 /USER:clustersvc@companyabc.com
/PASSWORD:password /IPADDRESS:192.168.100.10,255.255.255.0, "Cluster Mix Nic"

Then press Enter to create the cluster.

This chapter is from Microsoft Windows Server 2003 Unleashed, by Rand Morimoto, et al. (Sams Publishing, 2004, ISBN: 0672326671). Check it out at your favorite bookstore today.

Buy this book now.

Adding Additional Nodes to a Cluster

A cluster in Windows Server 2003 Enterprise Edition can support up to four nodes. After the first server is installed in a cluster, additional nodes can be added to the cluster.

To add more nodes to a cluster, do the following:

  1. Log on to the desired cluster node using an account that has Local Administrator privileges.

  2. Click Start, Administrative Tools, Cluster Administrator.

  3. When the Cluster Administrator opens, choose Add Nodes to a Cluster and type the name of the cluster in the Cluster Name text box. Click OK to continue.

  4. When the Add Nodes Wizard appears, click Next to continue.

  5. Type in the server name of the next node and click Add.

  6. Repeat the preceding steps until you've entered all the additional nodes you want in the Selected Computer text box. Click Next to continue. The cluster analyzer will then analyze the additional nodes for functionality and cluster requirements.

  7. Review the log and make changes as necessary; then click Re-analyze or click Next to continue.

  8. Enter the Cluster Service account password and click Next to continue.

  9. Review the configuration on the Proposed Cluster Configuration page and click Next to configure the cluster.

  10. After the cluster is configured, click Next and then click Finish to complete adding additional nodes to the cluster.

  11. Select File, Close to exit the Cluster Administrator.

Managing Clusters

To manage a cluster effectively, an administrator must be familiar with managing cluster groups and resources using one or more cluster management applications. Microsoft provided two cluster management applications for Cluster Service: one GUI-based and one command line–based.

Cluster Administrator

The Cluster Administrator, shown in Figure 31.8, gives an administrator a GUI-based tool for managing clusters. This tool can be used to manage local and remote clusters, including tasks such as creating new clusters, adding nodes to existing clusters, and creating cluster resource groups or resources. This tool can also be used to remove (evict) nodes from a cluster and perform manual failovers of cluster groups.

Morimoto

Figure 31.8 Sample Cluster Administrator tool screen.

The Cluster.exe Utility

Cluster.exe is a command-line utility that can be used to manage a local or remote cluster from a command line or a shell. This tool can be used to access a cluster when the GUI-based Cluster Administrator will not open. Additionally, this tool can be used in a script to remotely deploy or change cluster configurations.

Cluster Automation Server

The Cluster Automation server provides a mechanism for software developers and Independent Software Vendors (ISVs) to create custom cluster-management applications to enhance or provide administration of clusters. The Cluster Automation server provides a set of Component Object Model (COM) objects to allow developers to create scripts to automate the management of their clusters.

Configuring Failover and Failback

Clusters that contain two or more nodes automatically have failover configured for each defined cluster group when the second node and following nodes join the cluster. By manually adding additional nodes to existing cluster groups, the administrator can add failover functionality to every node in the cluster on a group-by-group basis. Failback is never configured by default and needs to be manually configured for each cluster group if desired. Failback allows a designated preferred server to always run a particular cluster group when it is available.

This chapter is from Microsoft Windows Server 2003 Unleashed, by Rand Morimoto, et al. (Sams Publishing, 2004, ISBN: 0672326671). Check it out at your favorite bookstore today.

Buy this book now.

Cluster Group Failover Configuration

To create a failover and failback process, the cluster group failover configuration needs to be set up properly. Follow these steps to configure cluster group failover:

  1. Click Start, Administrative Tools, Cluster Administrator.

  2. When the Cluster Administrator opens, choose Open Connection to Cluster and type the name of the cluster in the Cluster Name text box. Click OK to continue. If the local machine is part of the cluster, enter . (period) as the cluster name, and the program will connect to the cluster running on the local machine.

  3. Right-click the appropriate cluster group and select Properties.

  4. Select the Failover tab and set the maximum number of failovers allowed during a predefined period of time. When the number of failovers is exceeded within the Period interval, shown as a threshold of 10 in Figure 31.9, Cluster Service will change the group to a failed state.

    Morimoto

    Figure 31.9
    Setting failover thresholds for the cluster group.

  5. Click Next and then Finish to complete the failover configuration.

  6. Select File, Close to exit Cluster Administrator.

Cluster Group Failback Configuration

The cluster group failback process involves making configuration changes in the Cluster Administrator utility. Follow these steps to configure cluster group failback:

  1. Click Start, Administrative Tools, Cluster Administrator.

  2. When Cluster Administrator opens, choose Open Connection to Cluster and type the name of the cluster in the Cluster Name text box. Click OK to continue.

  3. Right-click the appropriate cluster resource group and select Properties.

  4. On the General tab, click the Modify button to select the preferred owners. Double-click the node or nodes you prefer the cluster group to run on and click OK to return to the cluster group's General tab.

  5. Select the Failback tab, choose the Allow Failback radio button, and set time options for allowing failback.

  6. Click Next and then Finish to complete the failback configuration.

  7. Select File, Close to exit Cluster Administrator.


Note - To reduce the chance of having a group failing back to a node during regular business hours after a failure, configure the failback schedule to allow failback only during nonpeak times or after hours using settings similar to those made in Figure 31.10.


This chapter is from Microsoft Windows Server 2003 Unleashed, by Rand Morimoto, et al. (Sams Publishing, 2004, ISBN: 0672326671). Check it out at your favorite bookstore today.

Buy this book now.

Testing Clusters

After all the desired cluster nodes are added and failover and failback are configured for each cluster group to complete cluster installation, it is time to test cluster functionality. For these tests to be complete, failover and, when applicable, failback of cluster groups need to be tested. They can be tested by manual failover and also by taking a cluster node off the network by unplugging network cards. However, the cluster is not tested by disconnecting shared storage device connections because this may cause possible corruption in the shared storage data.


Note - Clusdiag.msi, located in the Windows Server 2003 Resource Kit, can be used to diagnose and test the cluster. It can also aid in troubleshooting failures by providing administrators reports based on prior testing.


Morimoto

Figure 31.10 Setting failback for a cluster file group.

Testing Cluster Group Manual Failover

To test the cluster group failover manually, follow these steps:

  1. Open Cluster Administrator, right-click the desired cluster group, and choose Take Offline.

  2. Right-click the same cluster group and choose Move Group. If the cluster contains more than two nodes, choose the node to which you want to move the group.

  3. Right-click the same cluster group and choose Bring Online.

  4. The group now should start on the node you chose in step 2. Repeat steps 1–3 for each cluster group, moving back and forth between all available cluster nodes.

  5. When testing is complete, move cluster groups to their desired cluster nodes and bring all groups online.

Initiating Failure of a Cluster Resource

To simulate a cluster resource failure, a cluster administrator can initiate a resource failure using the Cluster Administrator utility. This utility can be used to verify how a failing cluster resource will affect the cluster group.

To test the failure of a cluster resource, follow these steps:

  1. Open Cluster Administrator.

  2. Right-click the cluster resource you will manually fail and select Properties.

  3. Select the Advanced tab and note how many failures this resource will tolerate before it finally fails completely or fails the entire cluster group.

  4. Close the resource's property page.

  5. Right-click the cluster resource you will manually fail and choose Initiate Failure.

  6. Repeat the preceding steps as necessary to ensure proper operation during resource failure conditions.

  7. When testing is complete, move cluster groups to their desired cluster nodes and bring all groups online.

Initiating Cluster Node Network Failure

To simulate and verify how cluster groups will fail over during a cluster node network or network card failure, perform the following steps:

  1. Log on to the desired cluster node with Cluster Administrator or Local Administrator permissions.

  2. Click Start, Control Panel.

  3. Double-click the Network Connections applet.

  4. Right-click each of the cluster node's private network and public network adapters and choose Disable.

  5. On an available cluster node, log in using a Cluster Administrator account.

  6. Click Start, Administrative Tools, Cluster Administrator.

  7. If the Cluster Administrator does not connect to the cluster or connects to a different cluster, choose File, Open Connection.

  8. From the Active drop-down box, choose Open Connection to Cluster. Then, in the Cluster or Server Name drop-down box, type . (period) and click OK to connect.

  9. Verify that the network-disabled node appears as offline and that all cluster groups have failed over to other available cluster nodes.

  10. When testing is complete, enable all disabled network cards on the network-disabled node.

  11. Move cluster groups to their desired cluster nodes and bring all the groups online.

This chapter is from Microsoft Windows Server 2003 Unleashed, by Rand Morimoto, et al. (Sams Publishing, 2004, ISBN: 0672326671). Check it out at your favorite bookstore today.

Buy this book now.

Maintaining Cluster Nodes

Applications are clustered due to the critical part they play in a business. Even though the highest availability and fault tolerance are needed, each cluster node will, at one point or another, require maintenance for hardware or software upgrades. To prepare a cluster node for maintenance, a few preliminary and post steps need to occur.

Pre-Maintenance Tasks

Before maintenance is run on a cluster node, several tasks need to be completed. To prepare a cluster node for maintenance, do the following:

  1. Whether you're planning a software or hardware upgrade, research to see whether the changes will be supported on a cluster node.

  2. Log on to a cluster node that will remain online using an account that has Administrative permissions on the cluster.

  3. Click Start, Administrative Tools, Cluster Administrator.

  4. If Cluster Administrator does not open to the correct cluster or does not open a cluster, pull down the Cluster Server menu and choose to connect to an existing cluster. Then enter the cluster's fully qualified domain name and click OK.

  5. Find the server that will be going offline for maintenance and double-click it.

  6. Double-click Active Groups.

  7. If there are any active groups, when appropriate (after hours or during a change control session) right-click each active group and choose Move Group. If there are more than two nodes in the cluster, choose the node you are taking offline for maintenance.

  8. Repeat step 7 for each remaining active group.

  9. In the right pane, right-click the appropriate node and choose Pause Node.

  10. Close Cluster Administrator.

Perform necessary maintenance, including any reboots if necessary. Check to see that all updates have been applied successfully and the server hardware and software are running as expected. When all checks are completed, you are ready to make this node available in the cluster.

Post-Maintenance Tasks

After maintenance has been conducted on a cluster, several tasks need to be completed. To perform follow-up maintenance, do the following:

  1. Log on to a cluster node that has remained online using an account that has Administrative permissions on the cluster.

  2. Click Start, Administrative Tools, Cluster Administrator.

  3. If Cluster Administrator does not open to the correct cluster or does not open to any cluster, pull down the Cluster Server menu and choose Connect to an Existing Cluster. Then enter the cluster's fully qualified domain name and click OK.

  4. Find the server that is paused for maintenance, right-click it, and choose Resume Node.

  5. In the right pane, double-click the cluster name at the top of the window.

  6. Double-click Groups.

  7. In the left pane, right-click a cluster group that you want running back in the updated node and choose Move Group. (If there are more than two nodes in the cluster, choose the upgraded node.)

  8. Repeat step 7 for any additional cluster groups you want running on the upgraded node. When finished selecting the cluster groups, click OK to execute.

This chapter is from Microsoft Windows Server 2003 Unleashed, by Rand Morimoto, et al. (Sams Publishing, 2004, ISBN: 0672326671). Check it out at your favorite bookstore today.

Buy this book now.

Creating Additional Cluster Groups and Resources

The Cluster server supports multiple cluster groups that can be used to support several purposes. For instance, a cluster group can be created to consolidate a standalone file server to a virtual server running on the cluster or to run as a separate cluster application group. Also, some applications like Microsoft SQL Server 2000 may require separate cluster groups to operate efficiently. When additional cluster groups are necessary, they can be easily created using the Cluster Administrator program.

Creating Groups

To create new cluster groups, perform the following steps:

  1. Click Start, Administrative Tools, Cluster Administrator.

  2. When Cluster Administrator opens, choose Open Connection to Cluster and type the name of the cluster in the Cluster Name text box. Click OK to continue.

  3. Right-click the cluster and select New and then Group, as shown in Figure 31.11.

  4. Enter the appropriate information to complete the group addition.

  5. Click Next and then Finish after all groups have been created.

  6. Select File, Close to exit Cluster Administrator.

Creating New Resources

To create new resources, follow these steps:

  1. Click Start, Administrative Tools, Cluster Administrator.

  2. When Cluster Administrator opens, choose Open Connection to Cluster and type the name of the cluster in the Cluster Name text box. Click OK to continue.

  3. Right-click the cluster and select New Resource.

  4. Type in the appropriate name and description for the resource.

  5. Choose the correct resource type and which cluster group it will reside in.

    Morimoto

    Figure 31.11 Adding a new group for cluster configuration.

  6. Choose which servers can run the resource and click Next to continue.

  7. Choose which existing resources the new resource will depend on and click Next to continue.

  8. Enter any remaining resource parameters to complete the resource creation because certain resources have resource requirements. For instance, a network name resource depends on an IP address resource, so an IP resource must first be configured in a cluster group before a network name resource will be allowed.

  9. In Cluster Administrator, right-click the new resource and bring online.

  10. Select File, Close to exit Cluster Administrator.

Changing the Cluster Service Account Password

In previous versions of Cluster Service, changing the Cluster Service account password required bringing the Cluster Service down on each node and manually changing the cluster password using the Change Password applet. Then the Cluster Service logon credentials had to be changed in the Services applet in the Control Panel.

Starting with Windows Server 2003, the Cluster Service account password can be changed with the cluster online. Do not, however, change the password using the Active Directory Users and Computers snap-in or the Windows security box if logged in with that account. Instead, run the Cluster.exe command-line utility from a server on the network. At a command prompt, enter the following command to complete the password-changing operation:

Cluster.exe /cluster:clustername /changepass:currentpassword, newpassword

Then press Enter to continue.


Note - All nodes in the cluster must be running on the Windows Server 2003 operating system for this password-changing command to work.


Moving Cluster Groups

Moving a cluster group from one node to another makes the resources unavailable during the time necessary to take the group offline and bring it online on the next node.

If the administrator moves a group for the purposes of performing maintenance on a node, she must be sure to pause the node after all cluster groups are moved off. This ensures that no cluster groups will move to this node until the administrator resumes node operation after maintenance is performed.

If you want to move a group, right-click the cluster group and select Move Group. If more than two nodes are possible owners of this cluster group, choose the appropriate node to move this group to.

This chapter is from Microsoft Windows Server 2003 Unleashed, by Rand Morimoto, et al. (Sams Publishing, 2004, ISBN: 0672326671). Check it out at your favorite bookstore today.

Buy this book now.

Removing a Node from a Cluster

Cluster nodes can be removed from a cluster for a number of reasons, and this process can be accomplished quite quickly.


Note - If you're removing nodes on an MNS cluster, be sure that a majority of the nodes remain running to keep the cluster in a working state.


To remove a node from a cluster, follow these steps:

  1. Click Start, Administrative Tools, Cluster Administrator.

  2. When the Cluster Administrator opens, choose Open Connection to Cluster and type the server name of a node in the cluster that will remain up and running during this process.

  3. Double-click the node that will be removed from the cluster and click Active Groups.

  4. If any groups are running on the node, at the appropriate time move these groups to other available nodes.

  5. Right-click the cluster node and choose Stop Cluster Service.

  6. Right-click the cluster node and choose Evict Node, as shown in Figure 31.12.

  7. Confirm the eviction process by choosing Yes, and the node will be removed from the cluster immediately.


    Morimoto
    Figure 31.12
    Evicting a node.

  8. From a command line, run the following command to remove a node from a cluster:

    Cluster.exe /cluster:clustername node nodename /evict

    Then press Enter to evict the node.

  9. Select File, Close to exit Cluster Administrator.

Backing Up and Restoring Clusters

To successfully back up and restore the entire cluster or a single cluster node, the cluster administrator must first understand how to troubleshoot, back up, and restore a standalone Windows Server 2003. The process of backing up cluster nodes is the same as for a standalone server, but restoring a cluster may require additional steps or configurations that do not apply to a standalone server. Detailed Windows Server 2003 backup and restore techniques and disaster recovery planning best practices are discussed in Chapter 32, "Backing Up a Windows Server 2003 Environment," and Chapter 33, "Recovering from a Disaster." This section focuses mainly on backing up and restoring cluster nodes.

To be prepared to recover different types of cluster failures, you must take the following steps:

  1. For all cluster nodes (single, MNS, and single-quorum nodes), do the following:

    • Back up each cluster node's local disks.

    • Back up each cluster node's system state.

    • Back up the cluster quorum from any node running in the cluster.

    • Back up each cluster node's disks signatures and volume information.

  2. For clusters with shared storage devices, do the following in addition to Step 1:

    • On the individual cluster nodes, document storage adapter settings, including manufacturer name, model number, and configurations such as SCSI ID and IRQ when applicable. Also, note which motherboard slot the nodes are located in.

    • On shared storage devices with built-in RAID controllers, record disk array configurations, including array type, array members, hot spares, volume definition, disk IDs, and LUNs.

    • Back up shared cluster disks.

To back up cluster nodes and data on their storage devices, you use the Windows Server 2003 Backup utility (ntbackup.exe). For detailed information about this utility and the different backup options available, refer to Chapters 32 and 33.

This chapter is from Microsoft Windows Server 2003 Unleashed, by Rand Morimoto, et al. (Sams Publishing, 2004, ISBN: 0672326671). Check it out at your favorite bookstore today.

Buy this book now.

Cluster Node Backup Best Practices

As a backup best practice for cluster nodes, administrators should strive to back up everything as frequently as possible. Because cluster availability is so important, here are some recommendations for cluster node backup:

  • Back up each cluster node's system state daily and immediately before and after a cluster configuration change is made.

  • Back up cluster local drives and system state daily if the schedule permits or weekly if daily backups cannot be performed.

  • Back up cluster shared drives daily if the schedule permits or weekly if daily backups cannot be performed.

  • Use the MSCS Recovery Utility (ClusterRecovery) utility provided in the Windows Server 2003 Resource Kit to save configuration information such as checkpoint files. These checkpoint files are stored in the quorum but are still used to update Registry settings when resources are moved or failed over to another cluster node.

  • Perform an ASR backup on each node following the creation of a new cluster, monthly, and whenever a change is made on the node. For instance, back up when a new cluster application is installed or when a disk is added or removed from a cluster.

Automated System Recovery Backup

Automated System Recovery has two parts: the ASR backup and the ASR restore. An ASR backup can be used to satisfy one of a cluster node's backup requirements, backing up disk signatures and volume information. When a disk signature is overwritten and the cluster can no longer identify shared disks or read volume information, the administrator needs to restore cluster disk signatures using ASR restore. This approach, however, is a last resort and should be used only if no cluster nodes can communicate with the shared devices and all other cluster restore techniques have been exhausted.

An ASR backup of a cluster node contains a disk signature or signatures and volume information; the current system state, which includes the Registry, cluster quorum, boot files, and the COM+ class registration database; system services; and a backup of all local disks containing operating system files, including system and boot partitions. Currently, the only way to back up disk signatures is to create an ASR backup from the local server console using Windows Server 2003 Backup.

To perform an ASR backup, an administrator needs a blank floppy disk and a backup device; either a tape device or disk will suffice. Using recordable CDs and devices for use with the Backup utility is not yet supported, so if no tape device is available, the backup can be run to a backup file on a local or a network drive. Saving the backup file to a network drive helps to ensure that the media can be accessed when an ASR restore is necessary. One point to keep in mind is that an ASR backup will back up each local drive that contains the operating system and any applications installed. For instance, if the operating system is installed on drive C: and MS Office is installed on drive D:, both of these drives will be completely backed up. Although this can greatly simplify restore procedures, it requires additional storage and increases backup time. Using a basic installation of Windows Server 2003 Enterprise server with only the Cluster Service installed, an ASR backup averages 1.3GB in size.

To create an ASR backup, perform the following steps:

  1. Log on to the cluster node with an account that has the right to back up the system. (Any Local Administrator, Domain Administrator, or Cluster Service account has the necessary permissions to complete the operation.)

  2. Click Start, All Programs, Accessories, System Tools, Backup.

  3. If this is the first time you've run Backup, it will open in Wizard mode. Choose to run it in Advanced mode by clicking the Advanced Mode hyperlink. After you change to Advanced mode, the window should look similar to Figure 31.13.

  4. Click the Automated System Recovery Wizard button to start the Automated System Recovery Preparation Wizard.

  5. Click Next after reading the Automated System Recovery Preparation Wizard Welcome screen.

  6. Choose your backup media type and choose the correct media tape or file. If you're creating a new file, specify the complete path to the file, and the backup will create the file automatically. Click Next to continue.

  7. If the file you specified resides on a network drive, click OK at the warning message to continue, as shown in Figure 31.14.

    Morimoto

    Figure 31.13
    Windows Backup in Advanced mode.

    Morimoto

    Figure 31.14
    Warning when selecting a resource for backup.

  8. Click Finish to complete the Automated System Recovery Preparation Wizard and to start the backup.

  9. After the tape or file backup portion completes, the ASR backup prompts you to insert a floppy disk that will contain the recovery information. Insert the disk and click OK to continue.

  10. Remove the floppy disk as requested and label the disk with the appropriate ASR backup information. Click OK to continue.

  11. When the ASR backup is complete, click Close on the Backup Progress windows to return to the backup program or click Report to examine the backup report.

ASR backups should be performed periodically and immediately following any hardware changes to a cluster node, including changes on a shared storage device or local disk configuration. The information contained in the ASR floppy disk is also stored on the backup media. The ASR floppy contains two files, asr.sif and asrpnp.sif, that can be restored from the backup media and copied to a floppy disk when an ASR restore is necessary.

Backing Up the Cluster Quorum

The cluster quorum is backed up when the system state of any active cluster node is backed up. This backup can be used to restore a cluster node to operation when cluster database or log corruption occurs or when the previous state of a cluster needs to be rolled back up to every cluster node. The cluster quorum should be backed up frequently to ensure that the latest version of the cluster configuration is saved. To back up the cluster quorum, follow the steps outlined in the next section.

This chapter is from Microsoft Windows Server 2003 Unleashed, by Rand Morimoto, et al. (Sams Publishing, 2004, ISBN: 0672326671). Check it out at your favorite bookstore today.

Buy this book now.

Backing Up the Cluster Node System State

Each cluster node's system state should be backed up regularly and before and after any hardware or software changes, including cluster configuration changes. This backup will contain the cluster quorum, local server Registry, COM+ registration database, and boot files necessary to start the system. On a domain controller, the system state will also contain the Active Directory database and the SYSVOL folder.

To back up the system state, perform the following steps:

  1. Log on to the cluster node using an account that has the right to back up the system. (Any Local Administrator, Domain Administrator, or Cluster Service account has the necessary permissions to complete the operation.)

  2. Click Start, All Programs, Accessories, System Tools, Backup.

  3. If this is the first time you've run Backup, it will open in Wizard mode. Choose to run it in Advanced mode by clicking the Advanced Mode hyperlink. After you change to Advanced mode, the window should look like the one in Figure 31.13.

  4. Click the Backup Wizard (Advanced) button to start the Backup Wizard.

  5. Click Next on the Backup Wizard Welcome screen to continue.

  6. On the What to Back Up page, choose the Only Back Up the System State Data button, shown in Figure 31.15, and click Next to continue.

  7. Choose your backup media type and choose the correct media tape or file. If you're creating a new file, specify the complete path to the file, and the backup will create the file automatically. Click Next to continue.

  8. If the file you specified resides on a network drive, click OK at the warning message to continue.

  9. Click Finish to complete the Backup Wizard and start the backup.

  10. When the backup is complete, review the backup log for detailed information and click Close on the Backup Progress window when finished.

Morimoto

Figure 31.15 Choosing the correct option for backup.

Backing Up the Local Disks on a Cluster Node

The cluster node local disks should be backed up regularly and, if possible, should be backed up with the system state. This allows both the system state and local disks to be recovered if a complete server failure should occur.

To back up a cluster node's local disks, perform the following steps:

  1. Log on to the cluster node with an account that has the right to back up the system. (Any Local Administrator, Domain Administrator, or the Cluster Service account has the necessary permissions to complete the operation.)

  2. Click Start, All Programs, Accessories, System Tools, Backup.

  3. If this is the first time you've run Backup, it will open in Wizard mode. Choose to run it in Advanced mode by clicking the Advanced Mode hyperlink. After you change to Advanced mode, the window should look like the one in Figure 31.13.

  4. Click the Backup Wizard (Advanced) button to start the Backup Wizard.

  5. Click Next on the Backup Wizard Welcome screen to continue.

  6. On the What To Back Up page, choose the Back Up Selected Files, Drives, or Network Data button and click Next to continue.

  7. In the Items To Back Up window, shown in Figure 31.16, expand Desktop\My Computer and choose each of the local drives.

  8. Choose your backup media type and choose the correct media tape or file. If you're creating a new file, specify the complete path to the file, and the backup will create the file automatically. Click Next to continue.

    Morimoto

    Figure 31.16
    Choosing items to back up.

  9. If the file you specified resides on a network drive, click OK at the warning message to continue.

  10. Click Finish to complete the Backup Wizard and start the backup.

  11. When the backup is complete, review the backup log for detailed information and click Close on the Backup Progress window when finished.

Backing Up Shared Disks on a Cluster

Shared storage disks can be backed up in a few different ways. The first way is to back up the disks from the node that is currently hosting them. This way, the disks can be backed up using the same process used to back up local disks, except the shared disks are chosen in the Backup Selection window.

The second way requires knowledge of the disk drive letters or mount points; it can be run and scheduled from any machine on the network using an account with permission to back up the cluster disks. If the drive letters are known, the cluster administrator can create network places that point to the cluster disk's administrative hidden shares. Alternatively, the hidden drive shares can be mapped to a local drive letter and backed up using the appropriate mapped network drives.

For example, in a cluster called CLUSTER1 with nodes named SERVER1 and SERVER2 and two shared disks named Q and F, the administrator can back up the drives by creating a network place or mapping a drive to \\cluster1\F$ and \\cluster1\Q$. If the disk resources are currently running in groups active on SERVER1, the administrator can connect to those hidden drive shares using the UNC of \\SERVER1\F$ and \\SERVER1\Q$. Using the cluster name or the network name of the particular cluster group containing a disk resource is preferred because the path will be absolute regardless of which node the group is active on.


Note - If shared disks are defined as volume mount points, backing up the drive also backs up data under the mount points.


This chapter is from Microsoft Windows Server 2003 Unleashed, by Rand Morimoto, et al. (Sams Publishing, 2004, ISBN: 0672326671). Check it out at your favorite bookstore today.

Buy this book now.

Restoring a Single-Node Cluster When the Cluster Service Fails

When Cluster Service on a single node fails and will not start, it is usually a sign of corruption in the local cluster database file CLUSDB. In the interest of time, an administrator can replace the CLUSDB file with the latest CHKxxx.tmp file from the quorum disk's MSCS directory.

To replace the CLUSDB file, follow these steps:

  1. Log on to the cluster node using an account that has the right to back up the system. (Any Local Administrator, Domain Administrator, or Cluster Service account has the necessary permissions to complete the operation.)

  2. Open Cluster Administrator on an available cluster node. Then check to ensure that all cluster groups are running properly to verify that the Cluster Service problem is only on a single node.

  3. If only one node is experiencing Cluster Service startup problems, log on to the server console and click Start, All Programs, Administrative Tools, Services.

  4. In the Services applet, locate Cluster Service and double-click it.

  5. On the General tab of the property page for Cluster Service, disable the Startup Type service. Click OK to save changes.

  6. Reboot the server to release any file locks on the CLUSDB file.

  7. When the server completes the reboot process, log on with a Cluster Administrator account.

  8. Click Start, Run.

  9. Connect to the cluster quorum disk by using the UNC path \\<clustername>\<quorum_drive_letter>$. For example, in a cluster named cluster1 with a quorum disk named Q, use the path \\cluster1\Q$.

  10. Double-click the MSCS directory.

  11. Choose View, Details in the Explorer window.

  12. Locate the file named CHKxxx.tmp with the latest time stamp, similar to the one shown in Figure 31.17.

  13. Right-click the file and choose Copy. Then close the Explorer window.

  14. Click Start, Run.

    Morimoto

    Figure 31.17
    Choosing a backup set for restoral.

  15. Type in the full path to the cluster directory and click OK. The default path is C:\windows\cluster, where C is the system drive and windows is the %SystemRoot% directory.

  16. Locate the CLUSDB file, right-click it, and choose Rename.

  17. Rename the file to CLUSDB.old and press Enter to save. If the file cannot be renamed, make sure Cluster Service is set to disable, reboot the server, and then try again.

  18. Choose Edit, Paste in the Explorer window. The CHKxxx.tmp file should now be copied in the c:\windows\cluster directory.

  19. Locate the CHKxxx.tmp file, right-click it, and choose Rename.

  20. Rename the file to CLUSDB and press Enter to save. If the file cannot be renamed, make sure the Cluster Service is set to disable, reboot the server, and then try again.

  21. Close the Explorer window.

  22. Click Start, All Programs, Administrative Tools, Services.

  23. In the Services applet, locate Cluster Service and double-click it.

  24. On the General tab of Cluster Service's property page, change the Startup Type service to Automatic. Click OK to save your changes.

  25. Right-click Cluster Service and choose Start.

  26. When Cluster Service starts, move the appropriate group or groups to the recovered node to test failover functionality.

If this process does not restore operational status to Cluster Service, restore the system state from a previous backup by following these steps:

  1. Click Start, All Programs, Accessories, System Tools, Backup.

  2. If this is the first time you've run Backup, it will open in Wizard mode. Choose to run it in Advanced mode by clicking on the Advanced Mode hyperlink. After you change to Advanced mode, the window should look like the one in Figure 31.13.

  3. Click the Restore Wizard (Advanced) button to start the Restore Wizard.

  4. Click Next on the Restore Wizard Welcome screen to continue.

  5. On the What to Restore page, select the appropriate cataloged backup media, expand the catalog selection, and check System State, as shown in Figure 31.18. Click Next to continue.

    Morimoto

    Figure 31.18 
    Choosing to restore the system state.

  6. If the correct tape or file backup media does not appear in this window, cancel the restore process. Then, from the Restore Wizard page, locate and catalog the appropriate media and return to the restore process from step 1.


    Note - Refer to Chapter 33 for information on how to catalog tape and file backup media.


  7. On the Completing the Restore Wizard page, click Finish to start the restore.

  8. When the process is complete, review the log for detailed information and click Close when finished.

  9. Reboot the restored cluster node as prompted.

  10. When Cluster Service starts, move the appropriate group or groups to the recovered node to test failover functionality.

This chapter is from Microsoft Windows Server 2003 Unleashed, by Rand Morimoto, et al. (Sams Publishing, 2004, ISBN: 0672326671). Check it out at your favorite bookstore today.

Buy this book now.

Restoring a Single Node After a Complete Server Failure

When a single node fails, whether because of hardware problems or software corruption that cannot be repaired in a reasonable amount of time, the node must be rebuilt from scratch. After any hardware problems are resolved, the organization can decide what the best approach to server recovery will be. The two basic approaches to node recovery are outlined next.

Evicting and Rebuilding the Failed Node

This first node recovery process evicts the failed node from the cluster and requires the cluster administrator to rebuild the cluster node from scratch, rejoin the node to the cluster, install any cluster applications, and finally reconfigure the cluster's group failover and failback configurations.

To evict and rebuild the failed node, follow these steps:

  1. Shut down the failed cluster node.

  2. On an available cluster node, log in using a Cluster Administrator account.

  3. Click Start, Administrative Tools, Cluster Administrator.

  4. If Cluster Administrator does not connect to the cluster or connects to a different cluster, choose File, Open Connection.

  5. From the Active drop-down box, choose Open Connection to Cluster. Then, in the Cluster or Server Name drop-down box, type . (period) and click OK to connect.

  6. In the left pane of the Cluster Administrator window, right-click the offline cluster node and choose Evict Node.

  7. When the node is evicted, close Cluster Administrator and immediately start a backup of the local node's system state. Refer to the previous section "Backing Up the Cluster Node System State" for detailed steps for system state backup.

  8. On the failed node, install a clean copy of Windows Server 2003 Enterprise or Datacenter server.

  9. After it is loaded, configure the server to join the correct domain and configure all local drive letters and network card IP addresses as previously configured on the original cluster node. Then reboot if necessary.

  10. Follow the steps to rejoin the cluster as outlined in the previous section, "Adding Additional Nodes to a Cluster."

  11. After the node rejoins the cluster, install any cluster applications as outlined in the vendor's installation guide for cluster installation.

  12. Configure cluster group failover and failback as necessary and move cluster groups to their preferred node.

Restoring the Failed Node Using the ASR Restore

To restore the failed node using the ASR restore, follow these steps:

  1. Shut down the failed cluster node.

  2. On an available cluster node, log in using a Cluster Administrator account.

  3. Click Start, Administrative Tools, Cluster Administrator.

  4. If Cluster Administrator does not connect to the cluster or connects to a different cluster, choose File, Open Connection.

  5. From the Active drop-down box, choose Open Connection to Cluster. Then, in the Cluster or Server Name drop-down box, type . (period) and click OK to connect.

  6. Within each cluster group, make sure to disable failback to prevent these groups from failing over to a cluster node that is not completely restored. Close Cluster Administrator.

  7. Locate the ASR floppy created for the failed node or create the floppy from the files saved in the ASR backup media. For information on creating the ASR floppy from the ASR backup media, refer to Help and Support from any Windows Server 2003 Help and Support tool.

  8. Insert the operating system CD in the failed server and start the server.

  9. If necessary, when prompted, press F6 to install any third-party storage device drivers. This includes any third-party disk or tape controllers that Windows Server 2003 will not recognize.

  10. Press F2 when prompted to perform an automated system recovery.

  11. When prompted, insert the ASR floppy disk and press Enter.

  12. The operating system installation will proceed by restoring disk volume information and reformatting the volumes associated with the operating system. When this process is complete, restart the server as requested by pressing F3 and then Enter in the next window.

  13. After the system restarts, press a key if necessary to restart the CD installation.

  14. If necessary, when prompted, press F6 to install any third-party storage device drivers. This includes any third-party disk or tape controllers that Windows Server 2003 will not recognize.

  15. Press F2 when prompted to perform an automated system recovery.

  16. When prompted, insert the ASR floppy disk and press Enter.

  17. This time, the disks can be properly identified and will be formatted, and the system files will be copied to the respective disk volumes. When this process is complete, the ASR restore will automatically reboot the server. Remove the ASR floppy disk from the drive. The graphic-based OS installation will begin.

  18. If necessary, specify the network location of the backup media using a UNC path and enter authentication information if prompted. The ASR backup will attempt to reconnect to the backup media automatically but will be unable if the backup media are on a network drive.

  19. When the media are located, open the media and click Next. Then finish recovering the remaining ASR data.

  20. When the ASR restore is complete, if any local disk data was not restored with the ASR restore, restore all local disks.

  21. Click Start, All Programs, Accessories, System Tools, Backup.

  22. If this is the first time you've run Backup, it will open in Wizard mode. Choose to run it in Advanced mode by clicking the Advanced Mode hyperlink. After you change to Advanced mode, the window should look like the one in Figure 31.13.

  23. Click the Restore Wizard (Advanced) button to start the Restore Wizard.

  24. Click Next on the Restore Wizard Welcome screen to continue.

  25. On the What To Restore page, select the appropriate cataloged backup media, expand the catalog selection, and check each local drive. Click Next to continue.

  26. If the correct tape or file backup media do not appear in this window, cancel the restore process. Then locate and catalog the appropriate media from the Restore Wizard page and return to the restore process from step 23.


    Note - Refer to Chapter 33 for information on how to catalog tape and file backup media.


  27. On the Completing the Restore Wizard page, click Finish to start the restore. Because you want to restore only what ASR did not, you do not need to make any advanced restore configuration changes.

  28. When the restore is complete, reboot the server as prompted.

  29. After the reboot is complete, log on to the restored cluster node and check cluster node functionality.

  30. If everything is working properly, open Cluster Administrator and configure all cluster group failover and failback configurations.

  31. Move cluster groups to their preferred node and close Cluster Administrator.

This chapter is from Microsoft Windows Server 2003 Unleashed, by Rand Morimoto, et al. (Sams Publishing, 2004, ISBN: 0672326671). Check it out at your favorite bookstore today.

Buy this book now.

Restoring an Entire Cluster to a Previous State

Changes to a cluster should be made with caution and, if at all possible, should be made in a lab environment first. When cluster changes have been implemented and deliver undesirable effects, the way to roll back the cluster configuration to a previous state is to restore the cluster quorum to all nodes. This process is simpler than it sounds and is performed from only one node. There are only two disadvantages to this process:

  • All the cluster nodes that were members of the cluster previously need to be currently available and operational in the cluster. For example, if Cluster1 was made up of Server1 and Server2, both of these nodes need to be active in the cluster before the previous cluster configuration can be rolled back.

  • To restore a previous cluster configuration to all cluster nodes, the entire cluster needs to be taken offline long enough to restore the backup, reboot the node from which the backup was run, and manually start Cluster Service on all remaining nodes.


Note - If a cluster node is in a failed state, the cluster configuration cannot be rolled back. Refer to the "Restoring a Single Node After a Complete Server Failure" or the "Restoring the Failed Node Using the ASR Restore" sections to restore a failed cluster node to operational status and then restore a previous cluster configuration as shown here.


To restore an entire cluster to a previous state, perform the following steps:

  1. Log on to the cluster node using an account that has the right to back up the system. (Any Local Administrator, Domain Administrator, or Cluster Service account has the necessary permissions to complete the operation.)

  2. Click Start, All Programs, Accessories, System Tools, Backup.

  3. If this is the first time you've run Backup, it will open in Wizard mode. Choose to run it in Advanced mode by clicking the Advanced Mode hyperlink. After you change to Advanced mode, the window should look like the one in Figure 31.13.

  4. Click the Restore Wizard (Advanced) button to start the Restore Wizard.

  5. Click Next on the Restore Wizard Welcome screen to continue.

  6. On the What To Restore page, select the appropriate cataloged backup media, expand the catalog selection, and check System State (refer to Figure 31.18). Click Next to continue.

  7. If the correct tape or file backup media does not appear in this window, cancel the restore process. Then, from the Restore Wizard page, locate and catalog the appropriate media and return to the restore process from step 4.

  8. On the Completing the Restore Wizard page, select the Advanced button to configure advanced restore settings.

  9. On the Where To Restore page, choose to restore files to the original location and click Next.

  10. A warning message will pop up stating that the restoring system state will overwrite the current system state. Click OK to continue.

  11. On the How To Restore page, choose the Leave Existing Files (Recommended) radio button and click Next to continue.

  12. On the Advanced Restore Options page, check the Restore the Cluster Registry to the Quorum Disk and All Other Nodes box, similar to the options selected in Figure 31.19, and click Next to continue.

  13. A warning message pops up stating that this restore will replace the master version of the cluster quorum and will stop Cluster Service on all the other nodes in the cluster. Click Yes to continue.

  14. On the Completing The Restore Wizard page, click Finish to start the restore.

    Morimoto

    Figure 31.19 
    Selecting options for restoral.

  15. When the process is complete, review the log for detailed information and click Close when finished.

  16. Reboot the restored cluster node as prompted.

  17. After the restored node completes rebooting and the previous cluster configuration is restored, start Cluster Service on all the remaining cluster nodes.

  18. Move cluster groups as desired and close Cluster Administrator.

This chapter is from Microsoft Windows Server 2003 Unleashed, by Rand Morimoto, et al. (Sams Publishing, 2004, ISBN: 0672326671). Check it out at your favorite bookstore today.

Buy this book now.

Restoring Cluster Nodes After a Cluster Failure

Cluster nodes can be restored after a cluster failure using a combination of the previously described restore steps, with a few added steps. If each cluster node can start but Cluster Service cannot start on any node, there is most likely a problem with the quorum drive or quorum data.

To restore the cluster nodes in this situation, follow these steps:

  1. To restore the quorum data, follow the steps outlined in the section titled "Restoring a Single-Node Cluster When the Cluster Service Fails."

  2. After the system state restore is completed, if Cluster Service starts on the first node, start Cluster Service on all the remaining nodes.

    If Cluster Service does not start, there may be a problem with the cluster quorum drive. Make any necessary repairs on the cluster quorum drive and restore the cluster quorum as outlined in the section "Restoring a Single-Node Cluster When the Cluster Service Fails."

    If Cluster Service still does not start, follow the instructions in the Windows Server 2003 Help and Support article named "Recover from a Corrupted Quorum Log or Quorum Disk."

When all nodes in the cluster are non-operational and the cluster nodes need to be rebuilt from scratch, follow these steps:

  1. Power off all nodes in the cluster.

  2. Power on only the cluster node and perform an ASR restore as outlined in the section "Restoring the Failed Node Using the ASR Restore." This restore should restore the node and Cluster Service and basic cluster functionality.

  3. Restore any missing local disk data and cluster disk data.

  4. Perform ASR and local disk restores on remaining cluster nodes to restore complete cluster functionality.

Upgrading Cluster Nodes

Windows Server 2003 Cluster server is compatible with previous versions of Microsoft Cluster Service and can accommodate node operating system upgrades. Windows NT 4.0 clusters must be taken offline before upgrading to Windows Server 2003 clusters, whereas Windows 2000 clusters can be upgraded to Windows Server 2003 while online, utilizing the rolling upgrade method. Before a rolling upgrade can be performed, each resource in the cluster must be checked to see whether it can be upgraded during a rolling upgrade.


Note - Resources that do not allow rolling upgrades are IIS, FTP, DHCP, WINS, SMTP, and NNTP services, just to name a few. For detailed instruction on how to upgrade clusters containing these resources, refer to the Help and Support in the Windows Server 2003 operating system and search for "resource behavior during rolling upgrades" and "last node rolling upgrades."


Rolling Upgrades

A rolling upgrade allows a single cluster node to be taken offline for an operating system upgrade while the other nodes in the cluster function on the original OS version. On a standalone server, this is referred to as an inplace upgrade. When the upgraded node is back online with the new operating system, the Cluster server is already installed and configured. Cluster groups running on the other nodes can then be moved to the upgraded node, thus enabling administrators to upgrade the remaining nodes in the cluster.

Before attempting a rolling upgrade, the cluster administrator must research all the applications and resources in the cluster to ensure they can be supported during a rolling upgrade. If such an upgrade is not an option, the cluster nodes can be upgraded using the last node rolling upgrade method.

Last Node Rolling Upgrade

The last node rolling upgrade is a process created to upgrade clusters that contain resources that are unsupported during standard rolling upgrades. In this type of upgrade, the administrator moves all the groups containing resources that are unsupported in a standard rolling upgrade to a single cluster node. Then she upgrades all other nodes in the cluster. After all the other nodes are upgraded, she moves the groups with the unsupported resources to the upgraded nodes. Then the administrator performs an operating system upgrade on the last node and redistributes all the cluster groups as necessary.

This chapter is from Microsoft Windows Server 2003 Unleashed, by Rand Morimoto, et al. (Sams Publishing, 2004, ISBN: 0672326671). Check it out at your favorite bookstore today.

Buy this book now.

Installing Network Load Balancing Clusters

An NLB cluster can be created easily using the Network Load Balancing Manager utility provided with the Windows Server 2003 Administrative Tools. NLB clusters can also be created using the network interface card property pages or at a command prompt using NLB.exe. To properly configure an NLB cluster, the administrator needs to research the type of network traffic the load-balanced application or service will utilize. For example, to load-balance standard Web traffic, the cluster needs to support TCP port 80, and for Terminal Services, the cluster needs to support TCP port 3389.

NLB Applications and Services

Network load balancing is well equipped to distribute user connections and create fault tolerance for a number of different applications and network services. Because NLB does not replicate data across cluster nodes, using applications that require access to local data that can be changed by the end users is not a good choice. For example, file servers that store user data directories or databases are not a good choice because a user may save a file or change some data within a database while connected to one node and later reconnect to a different node to find his file missing or the changes made to the database are nonexistent.

Applications well suited for NLB clusters are Web sites serving static content or dynamic content built from a back-end database running outside the NLB cluster. Also, Windows Server 2003 Terminal servers, VPN servers, Internet Security and Acceleration servers, and streaming media servers are well suited to be deployed on NLB clusters.

Because the most important part of an NLB deployment is determining what cluster operation mode and port rules need to be used for the load-balanced application to function correctly, the cluster administrator must understand the application thoroughly. It's important to read the vendor's application documentation regarding how the client communicates with the application. For instance, certain applications use cookies or other stateful session information that can be used to identify a client throughout the entire session. As a result, applications configured to prompt users for authentication upon starting a session will fail if the user's future requests are sent to a different cluster node that has not authenticated the user. Knowing these considerations in advance will help determine the required settings that need to be configured using cluster port rules and the filtering mode.

Creating Port Rules

When an NLB cluster is created, one general port rule is also created for this cluster. The port rule or rules define what type of network traffic the cluster will load-balance across the cluster nodes. The Port Rules Filtering option defines how the traffic will be balanced across each individual node. As a best practice, limiting the allowed ports for the clustered IP addresses to only those needed by the cluster load-balanced applications can improve overall cluster performance and security. In an NLB cluster, because each node can answer for the clustered IP address, all inbound traffic is received at each node. When a node receives the request, it either handles the request or drops the packet if another node already has a session with a source client. If a port rule does not define how traffic will be handled for a particular TCP or UPD port, traffic on those ports will be handled by the cluster node with the lowest host priority.

When an administrator creates port rules that allow only specific ports to the clustered IP address and an additional rule blocking all other ports and ranges, the cluster nodes can quickly eliminate and drop packets that do not meet the port rules, thereby improving performance by blindly dropping any packets not allowed by the cluster. The security benefit is that because only a specific port or service is available on the clustered IP address, monitoring that server and maintaining security updates are simpler.

Port Rules Filtering Mode and Affinity

Within a cluster port rule, the NLB administrator must configure the appropriate filtering mode. This allows the administrator to specify whether only one node or multiple nodes in the cluster can respond to requests from a single client throughout a session. There are three filtering modes: Single Host, Disable Port Range, and Multiple Host.

The Single Host Mode

The Single Host filtering mode provides network traffic meeting the port rule criteria to only one node in the cluster. An example is an IIS Web farm in which only one server has a Secure Sockets Layer (SSL) certificate for a secure Web site. In this case, creating a rule to allow port TCP 443 (SSL port) using single host filtering isolates this traffic to the node with the certificate installed.

The Disable Port Range Mode

The Disable Port Range filtering mode tells the cluster which ports not to listen on and to drop these packets without investigation. Administrators should configure port rules and use this filter mode for ports and port ranges that do not need to be load-balanced across the cluster nodes.

The Multiple Host Mode

The Multiple Host filtering mode is probably the most commonly used filtering mode and is also the default. This mode allows traffic to be handled by all the nodes in the cluster. When traffic is balanced across multiple nodes, the application requirements define how the affinity mode should be set.

There are three types of multiple host affinities:

  • None—This affinity type can send a unique client's requests to all the servers in the cluster during the session. This can speed up server response times but is well suited only for serving static data to clients. This affinity type works well for general Web browsing and read-only file and FTP servers.

  • Class C—This affinity type routes traffic from a particular class C address space to a single NLB cluster node. This mode is not used too often but can accommodate client sessions that do require stateful data. This affinity does not work well if all the client requests are proxied through a single firewall.

  • Single—This affinity type is the most widely used. After the initial request is received by the cluster nodes from a particular client, that node will handle every request from that client until the session is completed. This affinity type can accommodate sessions that require stateful data.

Avoiding Switch Port Flooding

Because each node in an NLB cluster answers for incoming traffic, the cluster nodes do not allow a switch to cache their network card MAC address because the cluster nodes want to determine how to route the incoming packets. Because the network switch cannot cache the MAC address associated with the cluster IP addresses, it broadcasts each incoming packet on every port of the switch, which triggers each device connected to respond. When there is heavy traffic going to the cluster, a network switch can become flooded with requests, decreasing performance.

To reduce the risk of switch flooding, the NLB nodes should be connected to an isolated switch or should be configured in a single VLAN if the switch and network support VLANs. For detailed information regarding VLAN configuration and avoiding switch flooding, refer to the network switch documentation.

Using Cluster Operation Mode

There are two cluster operation modes: Unicast and Multicast. Most network traffic is handled through Unicast mode. Clients and servers maintain a one-to-one network connection. Multicast networking allows a server to send out information to one multicast address that is then processed by a number of clients. To receive multicast data, a client joins a multicast group associated with the multicast address. Common applications that use multicast are streaming video Web sites, Internet radio, and Internet training or college courses.

Configuring Network Cards for NLB

Configuring the network cards on the NLB cluster nodes is the first step in building the cluster. Although these steps can be performed during cluster creation using the NLB Manager, the same result can be achieved by editing the TCP/IP properties of each of the cluster node's network cards.

Because many cluster installations utilize Unicast operation mode, this causes some limitations and network overhead on the cluster nodes. When a single network card is used in Unicast mode, the NLB Manager does not run from the local console, requiring the administrator to configure and manage the cluster from a non-cluster node or use the network card's TCP/IP and network load balancing property pages or the command-line tool NLB.exe. Also, due to the configuration, the network adapter's dedicated IP MAC address is replaced with the cluster IP MAC address, causing additional network traffic for all nodes in the cluster when communication is requested for the dedicated IP address.

Best practice for NLB cluster nodes running in Unicast mode is to have two network cards to allow host communication to occur on one NIC while cluster communication is isolated on the cluster NIC. Multiple NICS can also add greater flexibility when it comes to controlling traffic and managing network security.

This chapter is from Microsoft Windows Server 2003 Unleashed, by Rand Morimoto, et al. (Sams Publishing, 2004, ISBN: 0672326671). Check it out at your favorite bookstore today.

Buy this book now.

Using the Network Load Balancing Manager to Create a Cluster

Using the Network Load Balancing Manager is the simplest method of creating a cluster. If the NLB Manager is used, all additional cluster and dedicated IP addresses will be added to the respective cluster node when it joins the cluster. Adding additional nodes to the cluster is also simplified; the administrator needs to know only the cluster name or IP address to add the node to the cluster. Network Load Balancing Manager works well configuring clusters on remote servers but if the cluster is local, NLB Manager will only function correctly if the server has multiple network cards.

To create a cluster, follow these steps:

  1. Log on to the local console of a cluster node using an account with Local Administrator privileges.

  2. Click Start, All Programs, Administrative Tools, Network Load Balancing Manager.

  3. Choose Cluster, New.

  4. Enter the cluster IP address and subnet mask of the new cluster.

  5. Enter the fully qualified domain name for the cluster in the Full Internet Name text box.

  6. Enter the mode of operation (Unicast will meet most of your NLB application deployments).

  7. Configure a remote control password if you will be using the command-line utility NLB.exe to remotely manage the NLB cluster and click Next to continue.

  8. Enter any additional IP addresses that will be load-balanced and click Next to continue.

  9. Configure the appropriate port rules for each IP address in the cluster, being careful to set the correct affinity for the load-balanced applications.

  10. After creating all the allowed port rules, you should create disabled port rules to reduce network overhead for the cluster nodes. Be sure to have a port rule for every possible port and click Next on the Port Rules page after all port rules have been created. Figure 31.20 shows a best practice port rule for an NLB Terminal server implementation.

    Morimoto

    Figure 31.20 
    Port rule settings for NLB configuration.

  11. On the Connect page, type the name of the server you want to add to the cluster in the Host text box and click Connect.

  12. In the Interface Available window, select the NIC that will host the cluster IP address and click Next to continue.

  13. On the Host Parameters page, set the cluster node priority. Each node requires a unique host priority, and because this is the first node in the cluster, leave the default of 1.

  14. If the node will perform non-cluster–related network tasks in the same NIC, enter the dedicated IP address and subnet mask. The default is the IP address already bound on the network card.

  15. For nodes that will join the cluster immediately following the cluster creation and after startup, leave the initial host state to Started. When maintenance is necessary, you can change the default state of a particular cluster node to Stopped or Suspended to keep the server from joining the cluster following a reboot.

  16. After you enter all the information on the Host Parameters page, click Finish to create the cluster.

  17. When you're ready to release to the production environment, add the HOST or A record of the new cluster to the DNS domain table. Contact your DNS administrator for information on how to complete this task.

Adding Additional Nodes to an Existing NLB Cluster

When a cluster already exists, administrators can add nodes to it from any server or workstation by using network connectivity, Cluster Administrator permissions, and the Network Load Balancing Manager.

To add nodes to an existing cluster, perform the following steps:

  1. Log on to a workstation or server that has the Windows Server 2003 Administrative Tools installed.

  2. Click Start, All Programs, Administrative Tools and right-click Network Load Balancing Manager.

  3. Choose the Run-as option and specify an account that has Administrative permissions on the cluster.

  4. Choose Cluster, Connect to Existing.

  5. In the Host text box, type the IP address or name of the cluster and click Connect.

  6. From the Clusters window, select the cluster you want to connect to and click Finish to connect.

  7. In the right pane, right-click the cluster name and choose Add Host to Cluster, as shown in Figure 31.21.

    Morimoto

    Figure 31.21 
    Choosing to add a host to the cluster.

  8. On the Connect page, type the name of the server you want to add to the cluster in the Host text box and click Connect.

  9. In the Interface Available window, select the NIC that will host the cluster IP address and click Next to continue.

  10. On the Host Parameters page, set the cluster node priority. Each node requires a unique host priority, and because this is the first node in the cluster, leave the default of 1.

  11. If the node will perform non-cluster–related network tasks in the same NIC, enter the dedicated IP address and subnet mask. The default is the IP address already bound on the network card.

  12. For nodes that will join the cluster immediately following the cluster creation and after startup, leave the initial host state to Started. When maintenance is necessary, you can change the default state of a particular cluster node to Stopped or Suspended to keep the server from joining the cluster following a reboot.

  13. After you enter all the information in the Host Parameters page, click Finish to add the node to the cluster.

This chapter is from Microsoft Windows Server 2003 Unleashed, by Rand Morimoto, et al. (Sams Publishing, 2004, ISBN: 0672326671). Check it out at your favorite bookstore today.

Buy this book now.

Managing NLB Clusters

A cluster can be managed using the NLB Manager or the NLB.exe command-line utility. Using the NLB Manager, a node can be added, removed, or suspended from cluster operation to perform maintenance, including hardware or software updates. Because data is not replicated between cluster nodes, any data needs to be replicated manually or by using tools such as Robocopy.exe, which are located in the Windows Server 2003 Resource Kit.


Note - Network activity for NLB clusters can be monitored using the Network Monitor and parsers provided in the Windows Server 2003 Resource Kit. These parsers are called Wlbs_hb.dll and wlbs_rc.dll.


Backing Up and Restoring NLB Nodes

The procedure for backing up and restoring NLB nodes is no different than for standalone servers. An ASR backup should be created after any major server configuration change, and the local disks and system state of each node should be backed up regularly (weekly). An NLB configuration can be restored when the system state of a particular node is restored. If a full node recovery is necessary, the system state and local disks should be restored or an ASR restore should be performed. For detailed backup and restore procedures, refer to Chapters 32 and 33 and follow procedures for backing up and restoring standalone servers.

Performing Maintenance on a Cluster Node

To perform maintenance on an NLB cluster node, the administrator can temporarily remove the node from the cluster, perform the upgrade, and add it back in later. Removing the node from the cluster without affecting user connections requires the use of the drainstop option from the Network Load Balancing Manager. The drainstop option tells the cluster to take this node offline and immediately stop connecting new clients to this node. Existing sessions will remain active until they are all closed. When all the sessions are complete, maintenance can be performed, and the server can be made available in the cluster to start accepting user requests.

To perform maintenance on a cluster node, follow these steps:

  1. Log on to a workstation or server that has the Windows Server 2003 Administrative Tools installed.

  2. Click Start, All Programs, Administrative Tools and right-click Network Load Balancing Manager.

  3. Choose the Run-as option and specify an account that has Administrative permissions on the cluster.

  4. Choose Cluster, Connect to Existing.

  5. In the Host text box, type the IP address or name of the cluster and click Connect.

  6. From the Clusters window, select the cluster you want to connect to and click Finish to connect.

  7. Each node in the cluster should appear with a green background, signifying operational status. Right-click the node to perform maintenance and select Control Host, Drainstop, as shown in Figure 31.22.

    Morimoto

    Figure 31.22 
    Selecting the Control Host, Drainstop option.

  8. When the node is draining, it should have a half-red and half-green background, and the drainstop operation result should be listed in the log window. Right-click the draining cluster node and select Host Status.

  9. Refer to the summary status to verify that the node is draining and then click OK to close this window.

  10. After you complete all connections on this node, the node will turn red. Perform any necessary maintenance.

  11. When maintenance is complete, if no reboots are necessary, in the NLB Manager, right-click the node and choose Start. If a reboot is necessary, the node will rejoin the cluster according to Initial Host State settings on the Host property page. Change the Initial Host State settings as necessary to achieve the desired node effect according to the type of maintenance that is being performed.

  12. When the node completes rejoining the cluster, it should have a green background in the NLB Manager window.

  13. Click File, Close to exit the Network Load Balancing Manager utility.

Removing a Node from an NLB Cluster

To remove an existing node from a cluster, follow the steps up to step 10 in the "Performing Maintenance on a Cluster Node" section. Then do the following:

  1. Right-click the node and choose Delete Host.

  2. A warning message pops up stating that this action will remove the node from the cluster. Click the Yes button to remove the node.

Deleting the Entire Cluster

To delete an entire cluster, follow the procedure in the "Performing Maintenance on a Cluster Node" section on each node in the cluster. When all nodes are red, indicating a stopped status, right-click the cluster name and choose Delete Cluster, as shown in Figure 31.23.

Morimoto

Figure 31.23 
Deleting a cluster.

This chapter is from Microsoft Windows Server 2003 Unleashed, by Rand Morimoto, et al. (Sams Publishing, 2004, ISBN: 0672326671). Check it out at your favorite bookstore today.

Buy this book now.

Summary and Best Practices

Windows Server 2003 clustering services enable organizations to create system-level fault tolerance and provide high availability for mission-critical applications and services. Although Cluster Service and network load balancing are each characteristically different and are best deployed on very different types of applications, between them they can increase fault tolerance for almost any application.

Best Practices

  • Purchase quality server and network hardware to build a fault-tolerant system. The proper configuration of this hardware is equally important.

  • Create disk subsystem redundancy using RAID.

  • Don't attempt to run both MSCS and NLB on the same computer because Microsoft does not support them due to potential hardware-sharing conflicts.

  • Use cluster-aware applications so that the cluster service can monitor the application. A cluster-unaware application can run on a cluster, but the application itself it not monitored by Cluster Service.

  • Use active/passive clustering mode except in cases where performance is critical. Active/passive mode is easier to manage and maintain, and the licensing costs are generally lower.

  • Use NLB to provide connectivity to TCP/IP-based services such as Terminal services, Web sites, VPN services, and streaming media services.

  • Use Windows Server 2003 Cluster Services to provide server failover functionality for mission-critical applications such as enterprise messaging, databases, and file and print services.

  • Disable power management on each of the cluster nodes both in the motherboard BIOS and in the Power applet in the operating system's Control Panel to avoid unwanted failover.

  • Carefully choose whether to use a shared disk or a nonshared approach to clustering.

  • Always purchase one additional node when planning for an MNS cluster.

  • Be sure that both Microsoft and the software manufacturer certify that third-party software packages for Cluster Service will work on a Windows Server 2003 cluster; otherwise, support will be limited when troubleshooting is necessary.

  • Use multiple network cards in each node so that one card can be dedicated to internal cluster communication (private network) while the other can be used only for client connectivity (public network) or for both public and private communication (mixed network).

  • Configure the failback schedule to allow failback only during non-peak times or after hours to reduce the chance of having a group failing back to a node during regular business hours after a failure.

  • Thoroughly test failover and failback mechanisms.

  • Do not change the Cluster Service account password using the Active Directory Users and Computers snap-in or the Windows security box if logged in with that account.

  • Be sure that a majority of the nodes remain running to keep the cluster in a working state if you're removing a node from an MNS cluster.

  • Carefully consider backing up and restoring a cluster.

  • Perform ASR backups periodically and immediately following any hardware changes to a cluster node including changes on a shared storage device or local disk configuration.

  • Thoroughly understand the application that will be used before determining which clustering technology to use.

  • Create a port rule that allows only specific ports to the clustered IP address and an additional rule blocking all other ports and ranges.

  • Employ tools such as Robocopy.exe, which are located in the Windows Server 2003 resource kit or Application Center, to replicate data between NLB nodes.

This chapter is from Microsoft Windows Server 2003 Unleashed, by Rand Morimoto, et al. (Sams Publishing, 2004, ISBN: 0672326671). Check it out at your favorite bookstore today.

Buy this book now.


blog comments powered by Disqus
MS SQL SERVER ARTICLES

- MS SQL Sever 2012 Launch, New Idera Release
- OpenText Azure Cloud Solution, Geminaire Raa...
- Melissa Data Releases MatchUp Tool for SQL S...
- Glovia`s G2 ERP Solution to Support SQL Serv...
- Upgrade Assistant for SQL Server 2012 Releas...
- Azure Update Features Several New Improvemen...
- NT OBJECTives SQL Invader Tool Offers Free V...
- SQL Server ODBC Driver for Red Hat Enterpris...
- Heroku Postgres: A New SQL Database-as-a-Ser...
- Idera Compliance Manager 3.5 and SQL Server ...
- Microsoft and Joyent Announce Node.js Window...
- How to Install Xampp on Windows XP
- SQL Server 2008 SP3 and HP Database Enterpri...
- How To Install Windows Azure
- Microsoft Lync Coming to the Cloud/Mobile

ASP Web Hosting ASP.Net Web Hosting Windows Web Hosting
 
 
 

ASP Free Forums 
 RSS  Tutorials RSS
 RSS  Forums RSS
 RSS  All Feeds
Site Map 
Request Media Kit
Write For Us Get Paid 
Weekly Newsletter
 
Developer Updates  
Free Website Content 
Privacy Policy 
Support 


© 2003-2012 by Developer Shed. All rights reserved. DS Cluster 4 - Follow our Sitemap
Most Popular Topics
All ASP.Net Tutorials