Applies to:
Oracle Server – Enterprise Edition – Version: 9.2.0.1 to 11.1.0.7 – Release: 9.2 to 11.1
Oracle Server – Standard Edition – Version: 9.2.0.1 to 11.1.0.7 [Release: 9.2 to 11.1]
Information in this document applies to any platform.
Oracle Real Application Clusters
Purpose
This guide is provided to document common RAC issues and to understand the trace files that are required to be reviewed to understand the cause of the problem.
Last Review Date
December 1, 2008
Instructions for the Reader
Troubleshooting Details
Common issues when using Oracle Real Application clusters are
- GES potential blocker
- IPC send timed out
- ORA-29740 evicted by member %s, group incarnation %s”
- ORA-481 LMON process terminated with error
- ORA-480 LCK* process terminated with error
- Performance issue
- Instance Startup issues
In each of the cases, there are multiple trace files required to understand the cause of the problem. Oracle support may request these files to explain the cause of the problem. If the trace files are not cleaned up from the background_dump_dest directory periodically, it is possible that there may be old lms trace files from previous startup in that directory. If that is the case, care must be taken to ensure the lms trace files from the time of the issue be provided for analysis. The reason we need so many files is because each process has a specific task and may be required in order to find out why the holder is holding the resource for such a long time. Additionally the background processes may take systemstate dumps which may be required to understand the problem. Common trace files generated by processes specific to RAC are
- Alert.log from each node
- Lms[0-9] trace file from each node.
- lck trace file from each node
- Lmon trace file from each node
- diag trace from each node
- Any trace file documented in the alert.log at the time of the issue.
In almost all the cases below, All the files are required from each node to understand the problem. It is highly advisable that you create one zip file for each node’s relevant trace files before uploading (one zip file per node).
GES potential blocker
An instance reports this error when it is unable to get get a resource for a period of time. An example is shown below
enqueue info in file /opt/oracle/trace/10gR2/bdump/v10gR21_lmd0_6665.trc and DIAG trace file
In the above example, the instance is complaining about PID 27388 which is blocking access to resource LF. In order to find out details of why this process is holding the lock all the trace files mentioned above are required.
IPC send timed out
An instance reports IPC send timed out when the receiver does not acknowledge messages from the sender. This typically means that the private network that is used to communicate is broken or dropping packets. It is also possible to get these messages when the receiver does not get CPU resource to acknowledge the receipt.
Fri Sep 23 22:02:43 2008
Errors in file /opt/bdump/10gR2/bdump/v10gR21_lms0_16582.trc:
In the above example, the lms on Node 1 is reporting a send timeout, the details of the sender, receiver and the message itself is captured in the trace files. All the files mentioned earlier are required to understand the root cause.
ORA-29740 evicted by member %s, group incarnation %s
An instance reports an ORA-29740 when the instances cannot communicate with each other. There may be other reasons for ORA-29740. This error message is more common in 9i than 10g because starting with 10g, the clusterware may evict the node if there is a communication error assuming that it is sharing the same private network. All the files mentioned earlier is required to understand the cause of the eviction.
ORA-29740: evicted by member 0, group incarnation 18
Mon Dec 8 01:52:25 2007
LMON: terminating instance due to error 29740
ORA-481 LMON process terminated with error
An instance reports an ORA-481 error and crashes the instance when the lmon process dies. One of the reasons that an instance can report this error is because LMON process runs into some ORA-600 error.
ORA-481: LMON process terminated with error
Thu Sep 25 03:46:56 2008
LMON: terminating instance due to error 481
All the traces mentioned above are required to find out why lmon got the error.
ORA-480 LCK* process terminated with error
This error message is essentially the same as above except that the 480 is reported when the lck process terminates. All the traces mentioned above are required to find out why the lck process terminated
General Hang and/or Performance issue
Performance Issues
Performance issue’s are cases when Database is not performing optimally. Customer should try to explain why do they believe performance is bad. Common examples are
a.) Reports completed in X time yesterday compared to last week
b.) Insert/update/deletes are slow after moving from Single instance to RAC.
Some performance issues are side effects of applying an OS patch or increase in workload.To understand the cause of the performance issues, it is crucial to collect data like Statspack/AWR reports. It is also crucial to collect OS statistics. It is a good idea to collect statistics OS and DB related before any changes like applying a patch or adding a new node.
To understand the cause of the performance issue, The following information would be crucial
- Clear description of the performance problem along with an explanation of why they feel performance is bad.
- AWR reports (if they have the appropriate license) or use statspack reports from each node during the performance issue. AWR reports with 60 minutes interval is a good starting point.
- AWR report from the time performance was acceptable from each node. AWR reports with 60 minutes interval is a good starting point.
- OS Watcher or IPD/OS data from each node for the duration of then hang.
- Alert.logs from each instance
- Any trace files printed in the alert.log during the performance hang.
Hang Issues
A Database or instance hang is caused when a process is waiting forever. Eventually other processes queue behind this hung process and soon everything is hung. Oracle RAC DB has timeouts associated with crucial background processes that cause it to automatically dump diagnostic information when certain process is not responding. It should be noted that in some cases, a hang is not really a hang but a bad performance issue where the DB is so slow that Customers may incorrectly conceive it to be a hang.
To understand the cause of the hang issue, The following information would be crucial
- Clear description of the hang as to how it was detected.
- AWR reports (if they have the appropriate license) or use statspack reports from each node for the duration of the hang. AWR reports with 60 minutes interval is a good
- Clusterware logs as per Note 289690.1
- OS Watcher or IPD/OS data from each node
- Alert.logs from each instance
- Any trace files printed in the alert.log during the performance hang.
© 2009, www.oracledatabase12g.com. 版权所有.文章允许转载,但必须以链接方式注明源地址,否则追究法律责任.
相关文章 | Related posts:
- Data Gathering for Troubleshooting RAC Issues
- Data Gathering for Troubleshooting CRS Issues
- Oracle Real Application Clusters Installation and Configuration Best Practices
- Performance Tuning Guide for Siebel CRM Application on Oracle
- TROUBLESHOOTING: Possible Causes of Poor SQL Performance
- Troubleshooting JVM Performance Problems
- EVENT:10212 check cluster integrity
- UNIX: Checklist for Resolving Connect AS SYSDBA Issues
- TROUBLESHOOTING: Tuning Queries That Cannot be Modified
- EVENT:10228 trace application of redo by kcocbk




最新评论