How many of you have experienced Oracle rebooting nodes in your cluster?
Anyone know Why Oracle does this?
What is fencing?
Why do we fence?
How do we fence?
There are a number of questions or variants of questions that seem to come up regularly related to RAC and Oracle Clusterware, the way these products actually function, and why they behave the way they do. So over the next hour, we are going to lay these questions on the table, and do our best to de-mystify the behaviour.
In the context of clusterware, we hear questions like – what is the difference between CSS and CRS, or what is the role of the voting disk, and why do I need 3 of them, why does the clusterware reboot nodes, or what is the function of the VIP.
In the context of RAC, we hear lots of questions related to the physical interconnect set-up, sizing concerns for the interconnect and how does the addition of nodes affect the interconnect, how do we do load balancing, what exactly is ONS, etc.
So the agenda items are the specific questions we will delve into. As we get into it, you will undoubtedly have related or tangential questions that occur to you . Pls make a note of them, and we will do our best to get to them at the Q&A section.
This graphic shows the major components of a RAC cluster. As you can see each server node has it’s own operating system and Oracle binaries, including the necessary Oracle clusterware, and runs one or more Oracle instances connected to a single common database on the shared storage.
It’s important to note that as of 10g, Oracle provides all the necessary clusterware to manage cluster membership and inter-node communication. No other software is required. With Oracle providing all this functionality, end users are ensured of optimal integration with all supporting database features. It also significantly simplifies support issues.
Availability in the architecture is increased with the addition of cluster nodes. Three node clusters are common where customers require high availability. In the event of a node failure the RAC clusterware determines the new cluster membership and notifies the database of a change. Surviving nodes continue to process transactions while beginning on-line recovery of the failed-nodes’ transactions from the Redo Logs on the shared storage. (No information needs be recovered from the lost node.) The time it takes to complete recovery is a function of the activity at the time of failure and tuning parameters set by the Database Administrator. From the application side, users can be automatically re-connected to another node in the cluster by implementing Oracle’s Fast Application Notification (FAN) and Fast Connection Failover (FCF).
Lets take a closer look at the clusterware …
a) Event Management: Publish/subscribe Event forwarding mechanism
Handles the communication of critical cluster events, such as cluster state changes. Eg. Node/instance join/leave
b) High Availability Framework
Allows for creation and management of cluster resources, such as databases, listeners, application components
c)Process Monitor:
IO fencing functionality; prevents corruptions that can result if nodes/instances attempt to function in an uncoordinated way.
d) Group Membership
Provides the heartbeat functionality for keeping track of the health (Membership) of nodes and instances in the cluster.
e) VIP:
Predefined resource for enabling faster client detection (and failover) of public network failures.
a) evmd: Event forwarding mechanism
Handles the distribution of event packages to subscribing processes; communicates with its local and remote clients via ONS interfaces
Evmlogger, on demand, spawns RACGEVT children; RACGEVT scans callout directory and invokes callouts;
Callouts – are specific actions that can be configured to take place on specific events; can be configured by scripts in the crs_home/racg/usrco directory
b) crsd: cluster ready services – resource start/stop/disable/configure etc.
Engine for HA operation – monitors the “resources” that are defined to be part of the cluster.
Manages ‘application resources’
Starts, stops, checks and fails them over
Generates events when things happen
Spawns separate ‘actions’ to start/stop/check application resources
Maintains configuration profiles in OCR.
Stores current known state in OCR.
Crsd calls racgwrap [a shell script] that calls racgmain to start/stop/check resources; Provides process for OCR caching (“ocrd”); Runs as root;
Restarted automatically on failure
c) Process Monitor: oprocd or hangcheck timer: Our solution to Cluster I/O Fencing (corruption prevention).
Locked in memory, real time; Sleep a fixed time; If wake up time is too much later, reset processor & reboot; failure causes reboot; runs as root
Different implementations on different platforms. (e.g. hangchecktimer on Linux)
d) ocssd: CSS: Cluster synchronization services[heartbeat, reconfigs…
The Foundation for inter-process interaction allowing coordinated activities across nodes;
Provides three main Client services
a)Group services – extension of previous skgxn
b)Node information (cluster configuration) services – Node monitoring to detect nodes joining and leaving cluster
c)Lock services – shared and exclusive locks used by crsd to synch operations across all cluster nodes
e) gsd: group services daemon
Not used in 10g, but is there when you do a ps; formerly used to process srvctl commands, but is now integrated in crsd.
It continues to be present to support 9i clients for srvctl
As mentioned in the previous section, Group membership and the cluster heartbeat functionality is provided by CSS.
The heartbeat mechanisms are the foundation for determining who is and who is not, part of the cluster. And if the heartbeat fails, the offending node is reconfigured (evicted) out of the cluster.
Note: Node evictions are part of the design. They are a symptom of an underlying problem. They are not the problem itself. There are many different scenarios that can result in the eviction protocol being invoked, but they all synthesize to network and/or IO problems. Learning how to accurately diagnose these issues and isolate the problem is a critical success factor.
Disk heartbeat
Each node writes a disk heartbeat to each voting disk once per second
Each node reads their kill block once per second, if the kill block is overwritten node commits suicide.
During reconfig (join or leave) CSSD monitors all nodes and determines whether a node has a disk heartbeat, including those with no network heartbeat.
If no disk heartbeat within I/O timeout (MissCount during cluster reconfiguration) then node is declared as dead.
Voting disk needs to be mirrored, should it become unavailable, cluster will come down.
If an I/O error is reported immediately on access to the vote disk, we immediately mark the vote disk as offline so it isn't at the party anymore. So now we have (in our case) just two voting disks available. We do keep retrying access to that dead disk, and if it becomes available again and the data is uncorrupted we mark it online again. If a second vote disk suffered an I/O error in the window that the first disk was marked offline. So now we don't have quorum. Bang reboot.
Backup voting disks after initial cluster creation, and each time the number of nodes in the cluster is changed, (add node, delete node). I.e. we have information about all nodes in the cluster and it isn’t a plain bitmap.
Automaic backup of the voting disk, similar to OCR, will be available in a future release.
Multiple voting disks [ redundancy] were introduced in 10gR2.
You can see from the above example CSS log file entries ….
In the case of the Network Heartbeat, Node 4 missed 59 checkin’s, and then the eviction protocol was initiated.
And in the case of the disk heartbeat, the Disk Ping Monitor Thread (DPMT) it start’s spitting out warnings after approx 45 seconds of trying to access the voting disk.
Problems with access to a single vote disk will result in a Warning as opposed to an Error, as long as there are still a majority of vote disks accessible from each node, so the warning is non-fatal, and in this case, eviction protocol is not invoked.
The 45 seconds is an internal calculation based on misscount (in 10.2.0.1 and earlier)
The STONITH algorithm is essentially an interface for remotely powering down a node in the cluster. The idea is quite simple: when the software running on one machine wants to make sure another machine in the cluster is not using a resource, pull the plug on the other machine. The idea is simple and reliable, albeit admittedly brutal. The advantage of Stonith is that it has no specific hw requirements, and does not limit scalability, as is the case with other fencing mechanisms like SCSI3 PGR.
Voting files are used by CSS to ensure data integrity of the database by detecting and resolving network problems that could lead to a split-brain, so must be accessible at all times. There are other techniques used by other cluster managers, like quorum server, and quorum disks which function differently, but serve the same purpose.
Note that a majority of vote disks, i.e. N/2 + 1, must be accessible by each node to ensure that all pairs node have at least one voting file that they both see, which allows proper resolution of network issues; this is to address the possible complaint that 2 voting files provide redundancy, so a third should not be necessary.
During normal processing, each node writes a disk heartbeat once per second and also reads its kill block once per second. When the kill block indicates that the node has been evicted, the node exits, causing a node reboot.
As long as we have enough voting disks online, the node can survive, but when the number of offline voting disks is greater than or equal to the number of online voting disks, the Cluster Communication Service daemon will fail resulting in a reboot. The rationale for this is that as long as each node is required to have a majority of voting disks online, there is guaranteed to be one voting disk that both nodes in a 2 node pair can see.
Note: NFS support for 3rd voting disk on 10.2.0.2 Linux only
There have been a couple of significant changes to the disk heartbeat behaviour in the 10g life cycle …
MissCount is the maximum time, in seconds, that a cluster heartbeat (messages sent between nodes over the network interconnect or through voting disk; the prime indicator of connectivity), can be missed before entering into a cluster reconfiguration to evict the node.
Disktimeout deals with blocked/hanging IO’s, as opposed to IO errors (where an explicit IO error is returned). If IO error is returned, voting disk is immediately offlined, and as long as majority of vote disks remain, cluster will continue to function. In the background, we will continue looping to bring the vote disk back online.
Look up note that has instructions for changing misscount
This is the only supported method. Not following this method risks evictions and/or OCR corruption.
shut down CRS on all nodes but one
as root run crsctl on that remaining node`$CRS_HOME/bin/crsctl set css misscount
reboot the remaining node
restart all other nodes
Confirm the new css misscount setting via ocrdump
First important to note that GigE is the defacto standard for RAC cluster interconnects. There are a few customers using other interconnects (Infiniband is slowly gaining some interest), but the vast majority are using GigE. And in fact, the vast majority are using a single GigE, which means the bandwidth and latency associated with a single GigE is more than sufficient to meet the demands of the workload.
Having said that, how do we handle scalability (and failure) of the interconnect??
Let’s take a closer look at the interconnect fabric. It is a network, like any other, for a single dedicated specific (private) purpose – cluster communication. This shows what is a fairly typical resilient interconnect configuration …. Redundant NICS, redundant switches.
People can get creative with their network design, with use of VLANS, various switches, and different topologies. Latency is the key factor to minimize, along with reliability of the switches for cluster communications.
Once you have a resilient network infrastrucure, the upper layer software stacks must somehow be made resilient to failures in the network components.
Private for performance and stability
Need to maintain bandwidth exclusive to keep variation low.
Dual-ported or multiple NICs are good to have for failover, but rarely needed for performance, as the utilized bandwidth empirically is lower than the total capacity of a GbE link.
For data shipping in OLTP and DSS, larger MTUs are more efficient, because they reduce interrupt load, save CPU, avoid fragmentation and therefore the probability of “losing blocks” if a fragment is dropped due to congestion control, buffer overflows in switches or similar incidents related to the functioning of IPC and networks.
Jumbo frames need to be supported by drivers, NICs and switches. They usually require a certain amount of additional configuration.
In most known configurations to date, the bandwidth of 1 GbE is sufficient. The actual utilization depends on the size of the cluster nodes in terms of CPU power, the number of nodes accessing the same data, the size of the working set for an application. Most applications have good cache locality, and there are no increasing interconnect requirements when scaling the application out by adding cluster nodes and distributing the work over more instances or adding additional load. For small working sets which could fit into a small percentage of the available global buffer cache, the interconnect traffic may increase when the set remains constant.
The actual utilization is difficult to predict but in most cases is no reason for concern when it comes to providing adequate bandwidth. Typical utilizations for OLTP are usually much lower than the total available network capacity of 1 GbE.
For DSS queries which use inter-instance communication between slaves, the size of the database and the distribution of work between query slaves may require multiple GbE NICs.
For OLTP, a general rule is that if the number of CPUs in a cluster node exceeds 16 – 20 CPUs, multiple NICs may be required and are advisable. Moreover, the larger the nodes in terms of memort and CPU power, the more important it is to keep them available and make NICs redundant.
It is recommended to check and test the network infrastructure and protocol stack configuration thoroughly before committing a system to production. Specifically, socket buffer sizes, NIC data buffer and queue length sizes, negotiated bit rate and duplex more for NICs and switch ports, flow control settings.
For Jumbo frames, consult with the hardware vendour as to the optimal setting, because the NIC and driver resources may have to be increased.
In some cases, network interrupts are handled by a decicated CPU. If that CPU becomes 100% busy, performance will suffer and the IPC will not scale.
While the cluster verification utility automates some of these checks, it is advisable to thoroughly test the hardware and OS configuration with non-Oracle tools, such as netperf, iperf and other publicly available software.
Because the interconnect is typically a standard GigE, standard, platform-specific functionality can be used to provide failover and loadbalancing capabilities.
Different implementations:
Most solutions provide both failover and load balancing capability, but some only provide one of these. The number of physical network cards that can be bonded also varies. Most, but not all, support the IEEE 802.3ad Link Aggregation standard. Some solutions have additional functionality that enables physical links within a logical link to be connected to different switches, and this can be important for the overall network design. And some require compatible hw across all NIC cards and switches.
Historically, vendor cluster managers have included their own proprietary low latency protocols, which required their own specific failover and load balancing capabilities. Oracle Clusterware uses standard GigE and TCP/UDP …. So readily available Layer 2 bonding functionality works fine
Include OCR Keys for CRS traffic
multiple keys for interconnect?
Select * from V$cluster_interconnect
Show where RAC traffic goes, where CRS heartbeat goes, show how address specification is a bonded address, as opposed to the real IP.
Init.ora Cluster_interconnects is a parameter that overides the Clusterware interconnect for thedatabase it is defined in. Note that you can define many cluster_interconnects for a single database, but that doesn’t make it redundant.You still need to have the interface redundant with some type of bonding,teaming.The clusterware interconnect is the interface that is defined clusterwide as the interconnect.All your databases that doesn’t have a cluster_interconnect parameter will use this interface.It is also used as the CSS heartbeat interface.
it is recommended to configure the private names on the same network classified private with oifcfg or specified in CLUSTER_INTERCONNECTS. If the database uses a different networkthan the clusterware heartbeats, then network outage detection can take much longer.
bond0 Link encap:Ethernet HWaddr 00:04:23:B3:C5:05
inet addr:192.168.23.20 Bcast:192.168.255.255 Mask:255.255.0.0
UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1
RX packets:1205411285 errors:18775 dropped:18775 overruns:696 frame:0
TX packets:1216540252 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:2793727107 (2664.3 Mb) TX bytes:4193633924 (3999.3 Mb)
Krish: eth3 Link encap:Ethernet HWaddr 00:04:23:B3:C5:05
inet addr:192.168.23.20 Bcast:192.168.255.255 Mask:255.255.0.0
UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1
RX packets:610160132 errors:6017 dropped:6017 overruns:123 frame:0
TX packets:608270126 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:3539450021 (3375.4 Mb) TX bytes:4234517128 (4038.3 Mb)
Base address:0x9c80 Memory:ee840000-ee860000
Krish: eth5 Link encap:Ethernet HWaddr 00:04:23:B3:C5:05
inet addr:192.168.23.20 Bcast:192.168.255.255 Mask:255.255.0.0
UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1
RX packets:595251153 errors:12758 dropped:12758 overruns:573 frame:0
TX packets:608270126 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:3549244382 (3384.8 Mb) TX bytes:4254084092 (4057.0 Mb)
Base address:0x9c00 Memory:ee800000-ee820000
As seen earlier, a lot of cycles for block access are actually spent in the OS on process wakeup and scheduling as well as network stack processing. The LMSs or block server processes are a crucial component. They should always be scheduled immediately when they need to run. On a very busy system with many concurrent processes, the system load may have an impact on how predictably LMS can be scheduled.
The default number of LMS processes is based on the number of available CPUs and the goal is to minimize their number to keep individual LMS processes busy. Fewer LMS process have an additional advantage of allowing for better message aggregation and therefore more CPU efficient processing.
If not in DNS, then it should be at least in the /etc/hosts file for all nodes in cluster and any client node accessing the cluster.
Note the installer will not accept a VIP in the following
10.0.0.0 – 10.255.255.255 (10/8 prefix)
172.16.0.0 – 172.31.255.255 (172.16/12 prefix)
192.168.0.0 – 192.168.255.255 (192.168/16 prefix)
However VIPCA will let you configure them post install (just let VIPCA part of the install fail)
Our Virtual IP works differently to VIPs or Re-locateable Ips from other cluster software. The RAC VIP is architected for RAC where we have active instances on each node in the cluster. The RAC VIP will only accept connections when it is active on its home node. When a failure occurs on that node (ie node down or public network interface down), the Oracle Clusterware will re-locate the VIP to another node in the cluster. When it is on any node other than its home node, it will not be active but it will respond to connection requests with a silent error. The client should then failover immediately to the next address in the list. What we do not do is have the listener on the node that it is relocated to open a port (default is 1521) so the error the client SQL*Net layer gets back is as if the listener is not started.
To direct clients to the nodes in the cluster, you are recommended to use Services. Services can be active on one to many nodes in a cluster as defined by the DBA. Service are dynamic and can be changed at any time.
© 2010, www.oracledatabase12g.com. 版权所有.文章允许转载,但必须以链接方式注明源地址,否则追究法律责任.
相关文章 | Related posts:
- Demystifying Oracle RAC Internals
- CSS(Cluster Synchronization Services) Internals (INTERNAL ONLY)
- Adding a Node To 10gR2 RAC cluster
- Creating a RAC Physical Standby for a RAC Primary
- AUTOMATIC UNDO INTERNALS
- Know about RAC Clusterware Process OPROCD
- UNDERSTANDING CURSOR_SPACE_FOR_TIME WITH SAMPLE
- Understanding Bootstrap Of Oracle Database
- How to Validate Network and Name Resolution Setup for the Clusterware and RAC
- Using udev with Oracle Architecture (RAC & ASM)




最新评论