Data gathering for troubleshooting Oracle Real Application Cluster issues

作者: Maclean Liu , post on September 16th, 2009 , English Version
【本站文章除注明转载外,均为本站原创编译】
转载请注明:文章转载自: Oracle Clinic – Maclean Liu的个人技术博客 [http://www.oracledatabase12g.com/]
本文标题: Data gathering for troubleshooting Oracle Real Application Cluster issues
本文永久地址: http://www.oracledatabase12g.com/archives/data-gathering-for-troubleshooting-oracle-real-application-cluster-issues.html

Applies to:

Oracle Server – Enterprise Edition – Version: 9.2.0.1 to 11.1.0.7 – Release: 9.2 to 11.1
Oracle Server – Standard Edition – Version: 9.2.0.1 to 11.1.0.7   [Release: 9.2 to 11.1]
Information in this document applies to any platform.
Oracle Real Application Clusters

Purpose

This guide is provided to document common RAC issues and to understand the trace files that are required to be reviewed to understand the cause of the problem.

Last Review Date

December 1, 2008

Instructions for the Reader

A Troubleshooting Guide is provided to assist in debugging a specific issue. When possible, diagnostic tools are included in the document to assist in troubleshooting.

Troubleshooting Details

Common issues when using Oracle Real Application clusters are

  • GES potential blocker
  • IPC send timed out
  • ORA-29740 evicted by member %s, group incarnation %s”
  • ORA-481 LMON process terminated with error
  • ORA-480 LCK* process terminated with error
  • Performance issue
  • Instance Startup issues

In each of the cases, there are multiple trace files required to understand the cause of the problem. Oracle support may request these files to explain the cause of the problem. If the trace files are not cleaned up from the background_dump_dest directory periodically, it is possible that there may be old lms trace files from previous startup in that directory. If that is the case, care must be taken to ensure the lms trace files from the time of the issue be provided for analysis. The reason we need so many files is because each process has a specific task and may be required in order to find out why the holder is holding the resource for such a long time. Additionally the background processes may take systemstate dumps which may be required to understand the problem. Common trace files generated by processes specific to RAC are

  1. Alert.log from each node
  2. Lms[0-9] trace file from each node.
  3. lck trace file from each node
  4. Lmon trace file from each node
  5. diag trace from each node
  6. Any trace file documented in the alert.log at the time of the issue.

In almost all the cases below, All the files are required from each node to understand the problem.  It is highly advisable that you create one zip file for each node’s relevant trace files before uploading (one zip file per node).

GES potential blocker

An instance reports this error when it is unable to get get a resource for a period of time. An example is shown below

GES: Potential blocker (pid=27388) on resource LF-DC7B39ED-ABA41B1E;
enqueue info in file /opt/oracle/trace/10gR2/bdump/v10gR21_lmd0_6665.trc and DIAG  trace file

In the above example, the instance is complaining about PID 27388 which is blocking access to resource LF.  In order to find out details of why this process is holding the lock all the trace files mentioned above are required.

IPC send timed out

An instance reports IPC send timed out when the receiver does not acknowledge messages from the sender. This typically means that the private network that is used to communicate is broken or dropping packets. It is also possible to get these messages when the receiver does not get CPU resource to acknowledge the receipt.

IPC Send timeout detected. Receiver ospid 16582
Fri Sep 23 22:02:43 2008
Errors in file /opt/bdump/10gR2/bdump/v10gR21_lms0_16582.trc:

In the above example, the lms on Node 1 is reporting a send timeout, the details of the sender, receiver and the message itself is captured in the trace files. All the files mentioned earlier are required to understand the root cause.

ORA-29740 evicted by member %s, group incarnation %s

An instance reports an ORA-29740 when the instances cannot communicate with each other. There may be other reasons for ORA-29740. This error message is more common in 9i than 10g because starting with 10g, the clusterware may evict the node if there is a communication error assuming that it is sharing the same private network.  All the files mentioned earlier is required to understand the cause of the eviction.

Errors in file /opt/bdump/10gR2/bdump/v10gR21_lmon_121396.trc:
ORA-29740: evicted by member 0, group incarnation 18
Mon Dec 8 01:52:25 2007
LMON: terminating instance due to error 29740

ORA-481 LMON process terminated with error

An instance reports an ORA-481 error and crashes the instance when the lmon process dies. One of the reasons that an instance can report this error is because LMON process runs into some ORA-600 error.

Errors in file /opt/bdump/10gR2/bdump//v10gR21_lmon_9944.trc:
ORA-481: LMON process terminated with error
Thu Sep 25 03:46:56 2008
LMON: terminating instance due to error 481

All the traces mentioned above are required to find out why lmon got the error.

ORA-480 LCK* process terminated with error

This error message is essentially the same as above except that the 480 is reported when the lck process terminates. All the traces mentioned above are required to find out why the lck process terminated

General Hang and/or Performance issue

Performance Issues
Performance issue’s are cases when Database is not performing optimally. Customer should try to explain why do they believe performance is bad. Common examples are
a.) Reports completed in X time yesterday compared to last week
b.) Insert/update/deletes are slow after moving from Single instance to RAC.
Some performance issues are side effects of applying an OS patch or increase in workload.To understand the cause of the performance issues, it is crucial to collect data like Statspack/AWR reports. It is also crucial to collect OS statistics. It is a good idea to collect statistics OS and DB related before any changes like applying a patch or adding a new node.

To understand the cause of the performance issue,  The following information would be crucial

  1. Clear description of the performance problem along with an explanation of why they feel performance is bad.
  2. AWR reports (if they have the appropriate license) or use statspack reports from each node during the performance issue. AWR reports with 60 minutes interval is a good starting point.
  3. AWR report from the time performance was acceptable from each node. AWR reports with 60 minutes interval is a good starting point.
  4. OS Watcher or IPD/OS data from each node for the duration of then hang.
  5. Alert.logs from each instance
  6. Any trace files printed in the alert.log during the performance hang.

Hang Issues
A Database or instance hang is caused when a process is waiting forever. Eventually other processes queue behind this hung process and soon everything is hung. Oracle RAC DB has timeouts associated with crucial background processes that cause it to automatically dump diagnostic information when certain process is not responding.  It should be noted that in some cases, a hang is not really a hang but a bad performance issue where the DB is so slow that Customers may incorrectly conceive it to be a hang.

To understand the cause of the hang issue, The following information would be crucial

  1. Clear description of the hang as to how it was detected.
  2. AWR reports (if they have the appropriate license) or use statspack reports from each node for the duration of the hang. AWR reports with 60 minutes interval is a good
  3. Clusterware logs as per Note 289690.1
  4. OS Watcher or IPD/OS data from each node
  5. Alert.logs from each instance
  6. Any trace files printed in the alert.log during the performance hang.

© 2009, www.oracledatabase12g.com. 版权所有.文章允许转载,但必须以链接方式注明源地址,否则追究法律责任.

相关文章 | Related posts:

  1. Data Gathering for Troubleshooting RAC Issues
  2. Data Gathering for Troubleshooting CRS Issues
  3. Oracle Real Application Clusters Installation and Configuration Best Practices
  4. Performance Tuning Guide for Siebel CRM Application on Oracle
  5. TROUBLESHOOTING: Possible Causes of Poor SQL Performance
  6. Troubleshooting JVM Performance Problems
  7. EVENT:10212 check cluster integrity
  8. UNIX: Checklist for Resolving Connect AS SYSDBA Issues
  9. TROUBLESHOOTING: Tuning Queries That Cannot be Modified
  10. EVENT:10228 trace application of redo by kcocbk

Leave a Reply

  

  

  

You can use these HTML tags

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>