A document from oracle internal , I think this note hasn’t never been published on Internet, now share it:
CSS (Cluster Synchronization Services) Internals (INTERNAL ONLY)
PURPOSE
——-
The purpose of this document is to document the CSS internals.
SCOPE & APPLICATION
——————-
This document is INTERNAL ONLY and is intended for support and SHOULD NOT be given to customers.
CSS (CLUSTER SYNCHRONIZATION SERVICES) INTERNALS (INTERNAL ONLY)
================================================================
CSS is part of the CRS stack and provides client services, group management, and node monitoring. CSS is a multi-threaded application that runs as the oracle user. The primary components of the CSS daemon are the Node Monitor and the Group Manager. CSS uses network services via CLSC and Sqlnet for communication across nodes or with other processes on the same node. CSS will start up with the rest of the CSS stack on node startup. An example of the architecture in a 2-node configuration would be:
——————————————————————–
——————————————————————–
Primary Functions of CSS:
Client library provides APIs for clients, e.g. RDBMS
- Group services
- Lock services
- Node information
The CSS Daemon provides infrastructure
- Group Manager handles group and lock synchronization among nodes
- Node Monitor monitors other nodes
Client Services
===============
Group Services (clssgs.c) – CSS provides group services by notifying clients (such as lmon) of cluster membership information and changes. When an instance joins the cluster it will join a group via GM. When applications connect to CSS they join their group and all members of the group share some private data such as IPC endpoints. The application will then use these IPC endpoints for communication. There is also some public data accessible to non-group members. The global data store (not persistent) is available for bootstrap and initial contact. The oldest node or the node with the lowest number in the cluster is considered the source of this bitmap although the data is distributed for recovery reasons.
Lock Services (clssls.c) – CSS also provides rudimentary lock services. It has functionality to protect resources with shared and exclusive locks. These lock services are not a replacement for the iDLM. The performance characteristics for lock services are not as good as idlm due to the nature of locking. There are no lock escalation facilities and hence null to ex and then back to shared is a release and re-acquire.
Node Information Services (clssns.c) – These services provide cluster configuration information. It provides static information such as node names, node numbers, etc… This data can my change when nodes are added and removed from the cluster.
CSS Daemon
==========
Group Management (clssgm.c and clssgm1.c) – Group Management (GM) manages group and lock services. Once node serves as the GM master node. All nodes serialize GM requests through the master node. The master node broadcasts membership changes to all other nodes. Group membership is synchronized at each cluster reconfiguration. Each node interprets membership changes independently.
Node Monitoring (clssnm.c) – Node monitoring (NM) is used to verify the health of all members of the cluster. It will maintain consistency with vendor clusterware (if it exists) via skgxn. The node monitor will use the OCR to identify communications endpoints to listen on and connect to. Messages sent between the NM’s are asynch in 10.1.0.3 and above and sync in 10.1.0.2 which impacted startup times from several minutes to seconds. NM also does a heartbeat via the network and via disk (voting file) every 1 second to ensure that all members of the cluster are alive and responding. If a node does not respond NM will wait x seconds before evicting a member of the cluster where x is the misscount setting. When a node is evicted it receives a poison packet via the network and disk (voting file). The evicted member(s) will do a fast reboot to remove themselves from the cluster to ensure the health of the surviving cluster members. For more information on CSS evictions and reboots see Note 265769.1. When a membership change occurs (for example a node is evicted or shut down), NM initiates a reconfiguration which allows multiple nodes to join or leave the cluster concurrently.
CSS Startup
===========
The init.cssd shell script is run from the inittab which runs the $ORA_CRS_HOME/bin/ocssd shell script which runs the ocssd.bin executable. The ocssd.bin executable starts in clssscmain. From here it will enable CSS logging via the clssscLoggingThread and we will see the following messages in the $ORA_CSS_HOME/css/log/ocssd#.log:
2005-04-19 09:36:08.184 >USER: Oracle Database 10g CSS Release 10.1.0.3.0 Production Copyright 1996, 2004 Oracle. All rights reserved. 2005-04-19 09:36:08.184 >USER: CSS daemon log for node racbde1, number 1, in cluster crs 2005-04-19 09:36:08.227 [8192] >TRACE: clssscmain: local-only set to false
These messages indicate that clssscmain and clssscLoggingThread have been successfully started.
After this clssscmain will initialize these persistent threads:
--> clssnmClusterListener thread (via clssnmNMInitialize --> clssnmStartClusterListening) --> clssnm_skgxnmon thread (via clssnmNMInitialize --> clssnmStartClusterListening) Note: This thread is only persistent if the customer is using vendor clusterware --> clsc_authent_thrd (via clscinit) --> clssgmclientlsnr thread (via clssgmInitCMAPIServices) --> clssgmPeerListener thread (via clssgmInitNMMon and clssgmStartNMMon) --> clssnmPollingThread (via clssgmStartNMMon --> clssnmNMAttach) --> clssnmDiskPingThread (via clssgmStartNMMon --> clssnmNMAttach) --> clssnmSendingThread (via clssgmStartNMMon --> clssnmNMAttach) --> clssnmDiskPingMonitorThread (via clssgmStartNMMon --> clssnmNMAttach)
Because the clssgmPeerListener thread was the last thread spawned by the clssscmain thread, a stack trace of an idle clssscmain thread on a running CSS would look like this (Example on Linux):
Thread 1 (Thread 8192 (LWP 2098)): #0 0x40dfe681 in __libc_nanosleep () from /lib/i686/libc.so.6 #1 0x40e2b3fb in usleep (useconds=1000000) at ../sysdeps/unix/sysv/linux/usleep.c:30 #2 0x400784be in sltrusleep () from /u01/app/oracle/product/10.1.0/crs/lib/libhasgen10.so #3 0x08085b6a in clssgcsleep (millisec=1000) at clssgc.c:172 #4 0x0806e63c in clssgmStartNMMon (thrd=0x8113f1c, cmInfo=0x818db88) at clssgm.c:959 #5 0x08052bd7 in clssscmain (argc=1, argv=0xbffec084, envp=0xbffec08c) at clsssc.c:832 #6 0x08050909 in main (argc=1, argv=0xbffec084, envp=0xbffec08c) at s0clsssc.c:342
After clssscmain spawns it’s threads, the CSS log would show the following:
2005-04-19 09:36:08.275 [8192] >TRACE: clssnmReadNodeInfo: added node 1 (racbde1) to cluster 2005-04-19 09:36:08.286 [8192] >TRACE: clssnmReadNodeInfo: added node 2 (racbde2) to cluster 2005-04-19 09:36:08.290 [8192] >TRACE: clssnmVotingDevInit: quorum disk configured to be (/dev/raw/raw2) 2005-04-19 09:36:08.369 [8192] >TRACE: clssscFatalInit: fatal mode enabled 2005-04-19 09:36:08.370 [8192] >TRACE: clssnm_skgxnonline: Using vacuous skgxn monitor 2005-04-19 09:36:08.371 [16387] >TRACE: clsc_listen: (0x8227230) Listening on (ADDRESS=(PROTOCOL=tcp)(HOST=racbde1-priv)(PORT=49895)) 2005-04-19 09:36:08.375 [24580] >TRACE: clsc_listen: (0x822a030) Listening on (ADDRESS=(PROTOCOL=ipc)(KEY=Oracle_CSS_LclLstnr_crs_1)) 2005-04-19 09:36:08.376 [24580] >TRACE: clssgmclientlsnr: listening on (ADDRESS=(PROTOCOL=ipc)(KEY=Oracle_CSS_LclLstnr_crs_1))
What this shows is that the main thread has gone into Node Monitor code and read that there are 2 cluster members in the quorum disk and that the quorum disk lives on /dev/raw/raw2. The main thread then found that fatal mode has been enabled from scls. Then we found that no vendor clusterware is present so we are using the vacuous skgxn monitor. Once we see the clsc_listen entries, the clssnmClusterListener thread has been spawned and we are listening on racbde1-priv on port 49895. Netstat -a should confirm that this port is in use. The “clssgmclientlsnr” entry indicates that the clssgmclientlsnr thread has been successfully spawned.
Additional Info About Each Persistent Thread
============================================
clssscLoggingThread – This thread writes CSS trace buffers to the trace file. A stack trace of an idle clssscLoggingThread thread on a running CSS would look like this (Example on Linux):
Thread 3 (Thread 8194 (LWP 2107)): #0 0x40d71be5 in __sigsuspend (set=0x413e3950) at ../sysdeps/unix/sysv/linux/sigsuspend.c:45 #1 0x40d052b9 in __pthread_wait_for_restart_signal (self=0x413e3be0) at pthread.c:1027 #2 0x40d01bdc in pthread_cond_wait (cond=0x8127e30, mutex=0x8127dc8) at restart.h:34 #3 0x4007b7f3 in sltspcwait () from /u01/app/oracle/product/10.1.0/crs/lib/libhasgen10.so #4 0x080558dc in clssscLoggingThread (thrd=0x8130698) at clsssc.c:2013 #5 0x40d02c6f in pthread_start_thread (arg=0x413e3be0) at manager.c:279 #6 0x40e30cea in thread_start () from /lib/i686/libc.so.6 --------------------------------------------------------------------
clssnmClusterListener thread – This thread listens for incoming packets from other nodes by calling clscselect and dispatches them for appropriate handling. If this thread fails the node will reboot and be removed from the cluster. A stack trace of an idle clssnmClusterListener thread on a running CSS would look like this (Example on Linux):
Thread 4 (Thread 16387 (LWP 2108)): #0 0x40e29487 in __poll (fds=0x821da08, nfds=3, timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:63 #1 0x404df4c1 in ntevpque () from /u01/app/oracle/product/10.1.0/crs/lib/libclntsh.so.10.1 #2 0x404de48b in ntevque () from /u01/app/oracle/product/10.1.0/crs/lib/libclntsh.so.10.1 #3 0x404bdfae in nsevwait () from /u01/app/oracle/product/10.1.0/crs/lib/libclntsh.so.10.1 #4 0x4004ef61 in clsc_nswait (ugblm=0x81d2fb0, cxdl=0x4166c978, cxdc=0x4166c964, poll=0) at clsc.c:4780 #5 0x40052320 in clsc_select_ext (ugblm=0x81d2fb0, hdlr=0x8068414 <clssnmeventhndlr>, evpoll=0, noreap=0) at clsc.c:5975 #6 0x4004b042 in clsc_select (ugblm=0x81d2fb0, hdlr=0x8068414 <clssnmeventhndlr>, to_ms=0, pollint_ms=0, evpoll=0, noreap=0) at clsc.c:3274 #7 0x40048743 in clscselect (ugblm=0x81d2fb0, hdlr=0x8068414 <clssnmeventhndlr>, to_ms=0, pollint_ms=0, flags=32) at clsc.c:2153 #8 0x0805df9b in clssnmClusterListener (thrd=0x81d2ea8) at clssnm.c:3192 #9 0x40d02c6f in pthread_start_thread (arg=0x4166cbe0) at manager.c:279 #10 0x40e30cea in thread_start () from /lib/i686/libc.so.6 --------------------------------------------------------------------
clssnm_skgxnmon thread – This thread monitors skgxn events (from the vendor clusterware if it exists). It is also responsible for disconnecting Node Monitor connections if it sees an skgxn reconfiguration. This allows us to break up a send to an evicted node so that other processes can continue on the new cluster. A stack trace of an idle clssnm_skgxnmon thread on a running CSS would look like this (Example on Sun with Sun Cluster installed):
----------------- lwp# 3 / thread# 5 -------------------- ffffffff7cfa5a68 poll (0, 0, c8) ffffffff7c41e16c poll (c8, 0, 61, 0, ffffffff7be46f30, ff9) + 58 ffffffff7bd21a70 lk_wait_ast (876c550, 8, ffffffff7a209470, 8000000, 0, 801bb00) + 110 ffffffff7bd156f8 lk_sync_convert (0, 876c550, 801bb00, 61, 0, 1002c53ce) + 198 ffffffff7e902484 skgxnlcnv (ffffffff7a2099f4, 1002554a0, 876c550, 0, 2, 0) + e4 ffffffff7e903c0c skgxn_membermap (ffffffff7a2099f4, 1002c5260, 2, 1002c53be, 0, 1) + 6c ffffffff7e905c3c skgxnpstat (1, 1002c53e0, 100251720, 1, ffffffff7a2099f4, ffffffff7ea09650) + 43c 000000010003a994 clssnm_skgxnmon (1003d9330, 4, ffffffff7a2099ec, 0, 0, 1000) + 72c ffffffff7c41ebc8 _thread_start (100542400, 0, 0, 0, 0, 0) + 40 --------------------------------------------------------------------
clssgmclientlsnr – Creates a named pipe that establishes connections and waits for incoming client connections for the Group Manager by calling clscselect. A stack trace of an idle clssgmclientlsnr thread on a running CSS would look like this (Example on Linux):
Thread 5 (Thread 24580 (LWP 2109)): #0 0x40e29487 in __poll (fds=0x82727d8, nfds=15, timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:63 #1 0x404df4c1 in ntevpque () from /u01/app/oracle/product/10.1.0/crs/lib/libclntsh.so.10.1 #2 0x404de48b in ntevque () from /u01/app/oracle/product/10.1.0/crs/lib/libclntsh.so.10.1 #3 0x404bdfae in nsevwait () from /u01/app/oracle/product/10.1.0/crs/lib/libclntsh.so.10.1 #4 0x4004ef61 in clsc_nswait (ugblm=0x8229208, cxdl=0x418ad6dc, cxdc=0x418ad6c8, poll=0) at clsc.c:4780 #5 0x40052320 in clsc_select_ext (ugblm=0x8229208, hdlr=0x8084904 <clssgmclienteventhndlr>, evpoll=0, noreap=0) at clsc.c:5975 #6 0x4004b042 in clsc_select (ugblm=0x8229208, hdlr=0x8084904 <clssgmclienteventhndlr>, to_ms=0, pollint_ms=0, evpoll=0, noreap=0) at clsc.c:3274 #7 0x40048743 in clscselect (ugblm=0x8229208, hdlr=0x8084904 <clssgmclienteventhndlr>, to_ms=0, pollint_ms=0, flags=32) at clsc.c:2153 #8 0x0807b790 in clssgmclientlsnr (thrd=0x8229120) at clssgm1.c:498 #9 0x40d02c6f in pthread_start_thread (arg=0x418adbe0) at manager.c:279 #10 0x40e30cea in thread_start () from /lib/i686/libc.so.6 --------------------------------------------------------------------
clsc_authent_thrd – Thread to perform async authentication for clsc. A stack trace of an idle clsc_authent_thrd on a running CSS would look like this (Example on Linux):
Thread 6 (Thread 32773 (LWP 2110)): #0 0x40d71be5 in __sigsuspend (set=0x41aee870) at ../sysdeps/unix/sysv/linux/sigsuspend.c:45 #1 0x40d052b9 in __pthread_wait_for_restart_signal (self=0x41aeebe0) at pthread.c:1027 #2 0x40d01bdc in pthread_cond_wait (cond=0x827c0b0, mutex=0x827c048) at restart.h:34 #3 0x4007b7f3 in sltspcwait () from /u01/app/oracle/product/10.1.0/crs/lib/libhasgen10.so #4 0x40056c21 in clsc_cvwait (scx=0x8113da8, cv=0x827c038, mx=0x827c02c) at clsc.c:7793 #5 0x40055b35 in clsc_authent_thrd (thrd=0x827c008) at clsc.c:7472 #6 0x40d02c6f in pthread_start_thread (arg=0x41aeebe0) at manager.c:279 #7 0x40e30cea in thread_start () from /lib/i686/libc.so.6 --------------------------------------------------------------------
clssgmPeerListener thread – This thread listens for messages from remote GM threads. During a normal state, only the GM master will send messages via this channel. During a reconfiguration, this will be used by all nodes. A stack trace of an idle clssgmPeerListener thread on a running CSS would look like this (Example on Linux):
Thread 7 (Thread 40966 (LWP 2111)): #0 0x40e29487 in __poll (fds=0x82cc740, nfds=3, timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:63 #1 0x404df4c1 in ntevpque () from /u01/app/oracle/product/10.1.0/crs/lib/libclntsh.so.10.1 #2 0x404de48b in ntevque () from /u01/app/oracle/product/10.1.0/crs/lib/libclntsh.so.10.1 #3 0x404bdfae in nsevwait () from /u01/app/oracle/product/10.1.0/crs/lib/libclntsh.so.10.1 #4 0x4004ef61 in clsc_nswait (ugblm=0x8283170, cxdl=0x41d3069c, cxdc=0x41d30688, poll=0) at clsc.c:4780 #5 0x40052320 in clsc_select_ext (ugblm=0x8283170, hdlr=0x8079c78 <clssgmPeerEventHndlr>, evpoll=0, noreap=0) at clsc.c:5975 #6 0x4004b042 in clsc_select (ugblm=0x8283170, hdlr=0x8079c78 <clssgmPeerEventHndlr>, to_ms=0, pollint_ms=0, evpoll=0, noreap=0) at clsc.c:3274 #7 0x40048743 in clscselect (ugblm=0x8283170, hdlr=0x8079c78 <clssgmPeerEventHndlr>, to_ms=0, pollint_ms=0, flags=32) at clsc.c:2153 #8 0x080741bf in clssgmPeerListener (thrd=0x8283078) at clssgm.c:3150 #9 0x40d02c6f in pthread_start_thread (arg=0x41d30be0) at manager.c:279 #10 0x40e30cea in thread_start () from /lib/i686/libc.so.6 --------------------------------------------------------------------
clssnmPollingThread – Periodically wakes up and scans to see who is active and has been checking in regularly. This thread drives all reconfigurations. A stack trace of an idle clssnmPollingThread on a running CSS would look like this (Example on Linux):
Thread 8 (Thread 49159 (LWP 2112)): #0 0x40dfe681 in __libc_nanosleep () from /lib/i686/libc.so.6 #1 0x40e2b3fb in usleep (useconds=1000000) at ../sysdeps/unix/sysv/linux/usleep.c:30 #2 0x400784be in sltrusleep () from /u01/app/oracle/product/10.1.0/crs/lib/libhasgen10.so #3 0x08085b6a in clssgcsleep (millisec=1000) at clssgc.c:172 #4 0x0805f512 in clssnmPTWait (thrd=0x82d6e98, pNMInfo=0x8128e98) at clssnm.c:3650 #5 0x0805f7f0 in clssnmPollingThread (thrd=0x82d6e98) at clssnm.c:3711 #6 0x40d02c6f in pthread_start_thread (arg=0x420ffbe0) at manager.c:279 #7 0x40e30cea in thread_start () from /lib/i686/libc.so.6 --------------------------------------------------------------------
clssnmDiskPingThread – This thread writes disk heartbeats so that other members know we are still alive. A stack trace of an idle clssnmDiskPingThread on a running CSS would look like this (Example on Linux):
Thread 9 (Thread 57352 (LWP 2113)): #0 0x40dfe681 in __libc_nanosleep () from /lib/i686/libc.so.6 #1 0x40e2b3fb in usleep (useconds=1000000) at ../sysdeps/unix/sysv/linux/usleep.c:30 #2 0x400784be in sltrusleep () from /u01/app/oracle/product/10.1.0/crs/lib/libhasgen10.so #3 0x08085b6a in clssgcsleep (millisec=1000) at clssgc.c:172 #4 0x08060aae in clssnmDiskPingThread (thrd=0x82d6fa0) at clssnm.c:4169 #5 0x40d02c6f in pthread_start_thread (arg=0x422ffbe0) at manager.c:279 #6 0x40e30cea in thread_start () from /lib/i686/libc.so.6 --------------------------------------------------------------------
clssnmSendingThread – This thread periodically wakes up and sends appropriate packets (based on the join state) to other nodes so that other members know we are still alive. A stack trace of an idle clssnmSendingThread on a running CSS would look like this (Example on Linux):
Thread 10 (Thread 65545 (LWP 2114)): #0 0x40dfe681 in __libc_nanosleep () from /lib/i686/libc.so.6 #1 0x40e2b3fb in usleep (useconds=1000000) at ../sysdeps/unix/sysv/linux/usleep.c:30 #2 0x400784be in sltrusleep () from /u01/app/oracle/product/10.1.0/crs/lib/libhasgen10.so #3 0x08085b6a in clssgcsleep (millisec=1000) at clssgc.c:172 #4 0x0805f1a5 in clssnmSendingThread (thrd=0x82d70a8) at clssnm.c:3564 #5 0x40d02c6f in pthread_start_thread (arg=0x424ffbe0) at manager.c:279 #6 0x40e30cea in thread_start () from /lib/i686/libc.so.6 --------------------------------------------------------------------
clssnmDiskPingMonitorThread – Monitors the clssnmDiskPingThread to validate that it is correctly reading kill block. If the clssnmDiskPingThread isn’t reading the kill block the cluster member is evicted / rebooted. A stack trace of an idle clssnmSendingThread on a running CSS would look like this (Example on Linux):
Thread 11 (Thread 73738 (LWP 2115)): #0 0x40dfe681 in __libc_nanosleep () from /lib/i686/libc.so.6 #1 0x40e2b3fb in usleep (useconds=44040000) at ../sysdeps/unix/sysv/linux/usleep.c:30 #2 0x400784be in sltrusleep () from /u01/app/oracle/product/10.1.0/crs/lib/libhasgen10.so #3 0x08085b6a in clssgcsleep (millisec=44040) at clssgc.c:172 #4 0x0805f4c4 in clssnmDiskPingMonitorThread (thrd=0x82d71b0) at clssnm.c:3633 #5 0x40d02c6f in pthread_start_thread (arg=0x426ffbe0) at manager.c:279 #6 0x40e30cea in thread_start () from /lib/i686/libc.so.6 --------------------------------------------------------------------
CSS Reconfiguration
===================
NM Reconfiguration: The following shows an NM reconfiguration in the CSS log:
Joining Node:
2005-04-20 14:16:22.405 [16387] >TRACE: clssnmHandleSync: Acknowledging sync: src[2] seq[5] sync[2] 2005-04-20 14:16:26.417 [16387] >USER: clssnmHandleUpdate: SYNC(2) from node(2) completed 2005-04-20 14:16:26.417 [16387] >USER: clssnmHandleUpdate: NODE(1) IS ACTIVE MEMBER OF CLUSTER 2005-04-20 14:16:26.417 [16387] >USER: clssnmHandleUpdate: NODE(2) IS ACTIVE MEMBER OF CLUSTER 2005-04-20 14:16:26.423 [16387] >TRACE: clssnmvReadFatal: fatal mode assumed from no op 2005-04-20 14:16:26.515 [8192] >USER: NMEVENT_SUSPEND [00][00][00][00] 2005-04-20 14:16:27.525 [81931] >USER: NMEVENT_RECONFIG [00][00][00][06]
Master Node:
2005-04-20 14:16:21.367 [49159] >TRACE: clssnmDoSyncUpdate: Initiating sync 2 2005-04-20 14:16:22.367 [16387] >TRACE: clssnmHandleSync: Acknowledging sync: src[2] seq[5] sync[2] 2005-04-20 14:16:22.497 [8192] >USER: NMEVENT_SUSPEND [00][00][00][04] 2005-04-20 14:16:26.419 [16387] >USER: clssnmHandleUpdate: SYNC(2) from node(2) completed 2005-04-20 14:16:26.419 [16387] >USER: clssnmHandleUpdate: NODE(1) IS ACTIVE MEMBER OF CLUSTER 2005-04-20 14:16:26.419 [16387] >USER: clssnmHandleUpdate: NODE(2) IS ACTIVE MEMBER OF CLUSTER 2005-04-20 14:16:26.587 [90123] >USER: NMEVENT_RECONFIG [00][00][00][06]
These entries show that node 1 is attempting to join the cluster. When NM starts up it attempts to connect to all configured nodes (async connect starting in 10.1.0.3). If any alive nodes are found we send a joining message to these nodes, otherwise just start the reconfiguration. If we find other active nodes in the cluster, we wait for the reconfiguration to start (joining node never initiates the reconfiguration). In the above entries, node 1 is joining the cluster and node 2 is initiating the reconfiguration. Node 2 then sends a sync message to node 1. In the “clssnmHandleSync” line the nodes have acknowledged the sync and we declare that nodes 1 and 2 are active members of the cluster. The last line passes the NM reconfiguration event up to the GM.
There are more phases that this is doing that is not visible in the above trace file at trace level 1 (default). The detailed steps are:
NM Reconfiguration:
- Initial Phase
- Vote Phase
- Split Check Phase
- Evict Phase
- Update Phase
With the initial phase, the lowest numbered node starts the reconfiguration. Nodes in member or joining states participate in the reconfiguration. The RM (Reconfig Manager) sends a sync message to all participating nodes. Participating nodes respond with a sync acknowledgement. After this the vote phase begins and the master sends a vote message to all participating nodes. Participating nodes repond with a vote info message containing their node identifier and GM peer to peer listening endpoint. In the split-check phase, the RM uses the voting disk to verify there is no split-brain. It finds nodes heartbeating to disk that are not connected via the network. If it finds these, it will determine which nodes are talking to which and the largest subcluster survives. For example, if we have a 5 node cluster and all of the nodes are heartbeating to the voting disk but only a group of 3 can communicate via the network and a group of 2 can communication via the network, this means we have 2 subclusters. The largest subcluster (3) would survive while the other subcluster (2) would not. After this the evict phase would evict nodes previously in the cluster but not considered members in this incarnation. In this case we would send a message to evicted nodes (if possible) and write eviction notice to a ‘kill’ block in the voting file. We would wait for the node to indicate it got the eviction notice (wait for <misscount> seconds). The wait is terminated by a message or status on the voting file indicating that the node got the eviction notice. In the update phase the master sends an update message containing the definitive cluster membership and node information for all particpating nodes. The participating nodes send update acknowledgements. All members queue the reconfiguration event ot their GM.
GM Reconfig:
Once the NM reconfiguration is finished, GM will be notified and we will see something like the following in the CSS log:
2005-04-20 14:16:27.525 [81931] >USER: NMEVENT_RECONFIG [00][00][00][06] 2005-04-20 14:16:27.525 [81931] >TRACE: clssgmEstablishConnections: 2 nodes in cluster incarn 2 2005-04-20 14:16:27.526 [40966] >TRACE: clssgmInitialRecv: (0x82d7800) accepted a new connection from node 2 born at 1 active (2, 2), vers (10,2,1,2) 2005-04-20 14:16:27.526 [40966] >TRACE: clssgmInitialRecv: conns done (2/2) CLSS-3000: reconfiguration successful, incarnation 2 with 2 nodes CLSS-3001: local node number 1, master node number 2
When a GM reconfiguration occurs, most of the activity is local and independent. After GM gets the NEVENT_RECONFIG message it will clean up and create connections, remove group members from departed nodes, and determine the group master node. This may not be the same node as the RM from NM. When selecting the group master node we look for the oldest node then look for a node that is not the RM. In the above example node 2 is the group master node. Once the group master node has been selected, group member data is verified and hte master sends this data to all nodes. The completion is signalled by the DBDone message.
CSS Diagnostics
===============
CSS/CLSC Tracing:
Use crsctl to set the trace level:
10.2.0.2 (Bundle Patch 2) and above:
<CRS_HOME>/bin/crsctl debug log css CSSD:2 (default is 1)
(No subsequent reboot needed if you do it this way)
Prior versions:
<CRS_HOME>/bin/crsctl set css trace 2 (default is 1)
The CRS stack must be up when this command is run and a reboot is required for this to take effect.
Levels:
1=Production default
2=Development default
3=Verbose (performance degrading)
4=Super verbose (performance degrading)
Most problems can be solved with level 2. Some require level 3. Few require level 4. If you use level 3 or 4, trace information may only be kept for a few hours (or even minutes) because the trace files can fill up and information can be overwritten. If you need to keep data for a longer period of time, create a cron job to back up the css logs to unique file names (and compress them).
In 10.2 you can turn on additional CLSC and/or CLSC NS tracing by setting the following in <CRS_HOME>/bin/ocssd (and restarting CRS):
CLSC_TRACE_LVL=5
export CLSC_TRACE_LVL
CLSC_NSTRACE_LVL=12
export CLSC_NSTRACE_LVL
The CLSC trace level can be anything between 1 and 8.
The CLSC NS trace level can be anything between 1 and 16.
In 10.1 you can also set the NS / sqlnet trace level with:
$ORA_CRS_HOME/bin/crsctl set css nstrace 12
The CRS stack must be up when this command is run and a reboot is required for this to take effect.
The level can be set to anything between 1 and 16.
To trace the database interaction with CSS, you can set event 29718 to level 21-24. This will add additional information to the RDBMS trace file(s).
Check $ORA_CRS_HOME/css/init for core files, if there are core files, check all threads of the core file and get a stack trace for each. Note 118252.1 has information on gathering multiple threads.
For CSS reboots see Note 265769.1 for additional files to gather.
References
==========
- CSS Presentation by John Leys
- Note 265769.1 – 10g RAC Troubleshooting CRS Reboots
- Note 457772.1 – Internal Only: Advanced Troubleshooting of CSS Heartbeat Failures (Network)
- CSS and CLSC Source Code
attach rac process hierarchic as below:
© 2010, www.oracledatabase12g.com. 版权所有.文章允许转载,但必须以链接方式注明源地址,否则追究法律责任.
相关文章 | Related posts:
- Advanced Troubleshooting of CSS Heartbeat Failures (Network)
- 【书籍推荐】Oracle 8i Internal Services
- Adding a Node To 10gR2 RAC cluster
- Everything you ever wanted to know about the Cluster Health Monitor (CHM)
- Data gathering for troubleshooting Oracle Real Application Cluster issues
- EVENT:10212 check cluster integrity
- Demystifying Oracle RAC Internals
- A Close Look at Oracle8i Data Block Internals






Hello, I am not sure, how are posting an Oracle Internal only (confidential) document over the internet. You may want to check to make sure that you are not violating the Oracle Support contract agreement.
Amazing.
Thank you very much!