1、The Design and Architecture of the Microsoft Cluster Service(MSCS)-W.Vogels et al.ECE 845 PresentationBySandeep TamboliApril 18,20001Outline Prerequisites Introduction Design Goals Cluster Abstractions Cluster Operation Cluster Architecture Implementation Examples Summary2Prerequisites Availability=
2、MTTF/(MTTF+MTTR)MTTF:Mean Time To Failure MTTR:Mean Time To Repair High Availability:Modern taxonomy of High Availability:A system having sufficient redundancy in components to mask certain defined faults,has High Availability(HA).IBM High Availability Services:The goals of high availability solutio
3、ns are to minimize both the number of service interruptions and the time needed to recover when an outage does occur.High availability is not a specific technology nor a quantifiable attribute;it is a goal to be reached.This goal is different for each system and is based on the specific needs of the
4、 business the system supports.The presenter:May have degraded performance while a component is down3MSCS(a.k.a.Wolfpack)Extension of Windows NT to improve availability First phase of implementation Scalability limited up to 2 nodes MSCS features:Fail over Migration Automated restart Differences with
5、 previous HA solutions:Simpler User Interface More sophisticated modeling of applications Tighter integration with the OS(NT)4MSCS(2)Shared nothing cluster model:Each node owns a subset of cluster resources Only one node may own a resource at a time On failure,another node may take the resource owne
6、rship5Design Goals Commodity Commercial-off-the-shelf nodes Windows NT server Standard Internet protocols Scalability Transparency Presented as a single system to the clients System management tools manage as if a single server Service and system execution information available in single cluster wid
7、e log6Design Goals(2)Availability On failure detection Restart application on another node Migrate other resources ownership Restart policy can specify availability requirements of the application Hardware/software upgrades possible in phased manner7Cluster Abstractions Node:Runs an instance of Clus
8、ter Service Defined and active Resource Functionality offered at a node Physical:printer Logical:IP address Applications implement logical resources Exchange mail database SAP applications Quorum Resource Persistent storage for Cluster Configuration Database Arbitration mechanism to control membersh
9、ip Partition on a fault tolerant shared SCSI disk8Cluster Abstractions(2)Resource Dependencies Dependency trees:Sequence to bring resources online Resource Groups Unit of migration Virtual servers Application runs within virtual server environment Illusion to applications,administrators,and clients
10、of a single stable environment Client connects using virtual server name Enables many application instances to run on a same physical node9Cluster Abstractions(3)Cluster Configuration Database Replicated at each node Accessed through NT registry updates applied using Global Update Protocol10Cluster
11、Membership OperationOfflineStart ClusterService FailsCluster ServiceStartedMemberSearchJoiningPausedOnlineExitingSleepingQuorumDisk SearchFormingResumePauseJoinSucceedsJoin FailsFoundOnlineMemberSearch FailsSearch FailsEvict or Leave ClusterShutdown System Stop Cluster ServiceSynchronizeSucceedsTime
12、outRetriesExceededCompleteRundownQuorumDiskFoundInitializingKey:-Externally visibile state-Internal state11Member Join Sponsor broadcasts the identity of the joining node Sponsor informs the joining node about Current membership Cluster configuration database Joining members heartbeats start Sponsor
13、 waits for the first heartbeat Sponsor signals the other nodes to consider the joining node a full member Acknowledgement is sent to the joining node On failure,Join operation aborted Joining node removed from the membership12Member Regroup Upon suspicion that an active node has failed,member regrou
14、p operation is executed to detect any membership changes Reasons for suspicion:missing heartbeats power failures The regroup algorithm moves each node through 6 stages Each node sends periodic messages to all other nodes,indicating which stage it has finished Barrier synchronization13Regroup Algorit
15、hmActivate:After a local clock tick,each node sends and collects status messages Node advances if all responses collected or timeout occursClosing:It is determined if partitions exist and if current nodes partition should survivePruning:All nodes that are pruned for lack of connectivity,haltCleanup
16、phase one:All the surviving nodes Install new membership Mark the halted nodes as inactive Inform the cluster network manager to filter out halted nodes messages Make event manager invoke local callback handlers announcing node failuresCleanup phase two:A second cleanup callback is invoked to allow
17、a coordinated two-phase cleanupStabilized:The regroup has finished14Partition SurvivalA partition survives if any of the following is satisfied:n(new membership)1/2*n(original membership)Following three conditions satisfied together n(new membership)=1/2*n(Original membership)n(new membership)2 tieb
18、reaker node (new membership)Following three conditions satisfied together n(original membership)=2 n(new membership)=1 quorum disk (new membership)15Resource ManagementResource control DLL for each type of resourcePolymorphic design allows easy management of varied resource typesResource state trans
19、ition diagram:OfflineOnline-pendingFailedOffline-pendingOnlineRequest to offlineRequest to offlineRequest to onlineInit failedInit completeShutdown complete16Resource Migration:Pushing a group Executed when Resource failure at the original node Resource group prefers to execute at other node Adminis
20、trator moves the group Steps involved:All resources taken to offline state A new active host node selected Brought online at the new node17Resource Migration:Pulling a group Executed when The original node fails Steps involved A new active host node selected Brought online at the new node Nodes can
21、determine the new owner hosts without communicating with each other with the help of replicated cluster database18Resource Migration:Fail-back No automatic migration to preferred owner Constrained by fail-back window:How long must the node be up and running Blackout periods Fail-back deferred for co
22、st or availability reasons19Cluster ArchitectureComponents of the Cluster ServiceComponentFunctionalityEvent processorProvides intra-component event delivery serviceObject managerA simple object management system for the object collections in the Cluster ServiceNode managerControls the quorum Form a
23、nd Join process,generates node failure notifications,and managesnetwork and node objectsMembership managerHandles the dynamic cluster membership changesGlobal Update managerA distributed atomic update service for the volatile global cluster state variables.Database managerImplements the Cluster Conf
24、iguration DatabaseCheckpoint managerStores the current state of a resource(in general its registry entries)on persistent storage.Log managerProvides structured logging to persistent storage and a light-weight transaction mechanismResource managerControls the configuration and state of resources and
25、resource dependency trees.It monitors activeresources to see if they are still onlineFailover managerControls the placement of resource groups at cluster nodes.Responds to configuration changes andfailure notifications by migrating resource groupsNetwork mangerProvides inter-node communication among
26、 cluster members20Global Update Management Atomic broadcast protocol If one surviving member receives an update,all the surviving members eventually receive the update Locker node has a central role Steps in normal execution:A node wanting to start global update contacts the locker When accepted by
27、locker,the sender RPCs to each active node to install the update,in the order of node-ID starting with the node immediately after the locker Once global update is over,the sender sends the locker an unlock request to indicate successful termination21Failure Conditions If all the nodes that received
28、update fail=update never occurred If sender fails during the update operation Locker reconstructs the update and sends it to each active node Nodes ignore the duplicate update If sender and locker both fail after sender installed the update at any node beyond the locker The next node in the update l
29、ist is assigned as a new locker The new locker will complete the update22Support Components Cluster Network:Extension to the basic OS Heartbeat management Cluster Disk Driver:Extension to the basic OS Shared SCSI bus Cluster wide Event Logging Events sent via RPC to all other nodes (periodically)Tim
30、e Service Clock synchronization23Implementation Examples MS SQL Server A SQL Server resource group configured as Virtual Server 2-node cluster can have 2 or more HA SQL Servers Oracle servers Oracle Parallel Server Shared disk model Uses MSCS to track cluster organization and membership notification
31、s Oracle Fail-Safe server Each instance of Fail-Safe database is a virtual server Upon failure:The virtual server migrates to the other node The clients reconnect under the same name and address24Implementation Examples(2)SAP R/3 Three-tier client/server system Normal operation:One node hosts databa
32、se virtual server The other provides application components combined in a server Upon failure:The failed virtual server migrates to the surviving node The application servers are failover aware Migration of the application server needs new login session25Scalability Issues:Join Latency,Regroup messa
33、ges,GUP Latency,GUP throughput26Summary A highly available 2-node cluster design using commodity components Cluster is managed in 3 tiers Cluster abstractions Cluster operation Cluster Service components(interaction with OS)Design not scalable beyond about 16 nodes27Relevant URLs A Modern Taxonomy o
34、f High Availability interlog/resnick/HA.htm An overview of Clustering in Windows NT Server 4.0,Enterprise Edition microsoft/ntserver/ntserverenterprise/exec/overview/clustering.asp Scalability of MSCS cs.cornell.edu/rdc/mscs/nt98/IBM High Availability Services as.ibm/asus/highavail2.html High-Availa
35、bility Linux Project linux-ha.org/28Discussion Questions Is clustering the only choice for HA systems?Why is MSCS in use today despite of its scalability concerns?Does performance suffer because of HA provisions?Why?Are geographical HA solutions needed(in order to take care of site disasters)?This i
36、s good for transaction oriented services.What about,say,scientific computing?Hierarchical clustering?29GlossaryNetBIOS:Short for Network Basic Input Output System,an application programming interface(API)that augments the DOS BIOS by adding special functions for local-area networks(LANs).Almost all
37、LANs for PCs are based on the NetBIOS.Some LAN manufacturers have even extended it,adding additional network capabilities.NetBIOS relies on a message format called Server Message Block(SMB).SMB:Short for Server Message Block,a message format used by DOS and Windows to share files,directories and dev
38、ices.NetBIOS is based on the SMB format,and many network products use SMB.These SMB-based networks include Lan Manager,Windows for Workgroups,Windows NT,and LanServer.There are also a number of products that use SMB to enable file sharing among different operating system platforms.A product called Samba,for example,enables UNIX and Windows machines to share directories and files.30
侵权处理QQ:3464097650--上传资料QQ:3464097650
【声明】本站为“文档C2C交易模式”,即用户上传的文档直接卖给(下载)用户,本站只是网络空间服务平台,本站所有原创文档下载所得归上传人所有,如您发现上传作品侵犯了您的版权,请立刻联系我们并提供证据,我们将在3个工作日内予以改正。