Data Gathering for Troubleshooting RAC Issues

作者: Maclean Liu , post on September 16th, 2009 , English Version
【本站文章除注明转载外,均为本站原创编译】
转载请注明:文章转载自: Oracle Clinic – Maclean Liu的个人技术博客 [http://www.oracledatabase12g.com/]
本文标题: Data Gathering for Troubleshooting RAC Issues
本文永久地址: http://www.oracledatabase12g.com/archives/data-gathering-for-troubleshooting-rac-issues.html

Applies to:

Oracle Server – Standard Edition – Version: 9.2.0.1 to 11.1.0.7 – Release: 9.2 to 11.1
Oracle Server – Enterprise Edition – Version: 9.2.0.1 to 11.1.0.7   [Release: 9.2 to 11.1]
Information in this document applies to any platform.
Oracle Server Enterprise Edition – Version: 9.2.0.1 to 11.1.0.6
This note includes links to information that will be helpful in providing up from to Oracle Support when logging your TAR. Depending on the type of problem you are having, Oracle Support may require different types of diagnostic information in order to resolve the issue.

Purpose

This guide is provided to document common RAC issues and to understand the trace files that are required to be reviewed to understand the cause of the problem.

When gathering and uploading logs from any RAC environment it is highly recommended to gather, zip and upload one (zip) file per node.

Last Review Date

August 27, 2010

Instructions for the Reader

A Troubleshooting Guide is provided to assist in debugging a specific issue. When possible, diagnostic tools are included in the document to assist in troubleshooting.

Troubleshooting Details

I.  Data Gathering for Typical Errors – What to Gather

This section gives a brief list of typical errors,  and an indicator of what data to gather, and a brief explanation of each issue. Longer explanations of the data gathering requirements and how to gather them are provided in Section II.

GES potential blocker

  • Standard alert & trace files (Section II A)

An instance reports this error when it is unable to get get a resource for a period of time. An example is shown below:

GES: Potential blocker (pid=27388) on resource LF-DC7B39ED-ABA41B1E;
enqueue info in file /opt/oracle/trace/10gR2/bdump/v10gR21_lmd0_6665.trc and DIAG trace file

In the example above, the instance is complaining about PID 27388 which is blocking access to resource LF. In order to find out details of why this process is holding the lock all the trace files mentioned above are required.

IPC send timed out

  • Standard alert & trace files (Section II A)
  • OS Messages log (Section II C)
  • OSW or IPD/OS data (Section II E)

An instance reports IPC send timed out when the receiver does not acknowledge messages from the sender. This typically means that the private network that is used to communicate is broken or dropping packets. It is also possible to get these messages when the receiver does not get CPU resource to acknowledge the receipt.

IPC Send timeout detected. Receiver ospid 16582
Fri Sep 23 22:02:43 2008
Errors in file /opt/bdump/10gR2/bdump/v10gR21_lms0_16582.trc:

In the example above, the lms on Node 1 is reporting a send timeout, the details of the sender, receiver and the message itself is captured in the trace files. All the files mentioned earlier are required to understand the root cause.

Instance eviction / ORA-29740 evicted by member %s, group incarnation %s”

  • Standard alert & trace files (Section II A)
  • CRS logs (Section II B)
  • OS messages logs (Section II C)

An instance reports an ORA-29740 when the instances cannot communicate with each other. There may be other reasons for ORA-29740. This error message is more common in 9i than 10g because starting with 10g, the clusterware may evict the node if there is a communication error assuming that it is sharing the same private network. All the files mentioned earlier is required to understand the cause of the eviction.

Errors in file /opt/bdump/10gR2/bdump/v10gR21_lmon_121396.trc:
ORA-29740: evicted by member 0, group incarnation 18
Mon Dec 8 01:52:25 2007
LMON: terminating instance due to error 29740

ORA-481 LMON process terminated with error

  • Standard alert & trace files (Section II A)

An instance reports an ORA-481 error and crashes the instance when the lmon process dies. One of the reasons that an instance can report this error is because LMON process runs into some ORA-600 error.

Errors in file /opt/bdump/10gR2/bdump//v10gR21_lmon_9944.trc:
ORA-481: LMON process terminated with error
Thu Sep 25 03:46:56 2008
LMON: terminating instance due to error 481

ORA-480 LCK* process terminated with error

  • Standard alert & trace files (Section II A)

This error message is essentially the same as ORA-481 except that the 480 is reported when the lck process terminates.

ORA-600 or ORA-7445

  • Standard alert & trace files (Section II A)

See also:  Note 146581.1 How to deal with ORA-600 Internal Errors and Note 211909.1 Customer Introduction to ORA-7445 Errors

Other Instance Crashes, Process Crashes

  • Standard alert & trace files (Section II A)

Hangs, Deadlocks, and Process Spins

  • Clear description of problem
  • During the problem, you should gather systemstate and hanganalyze (Section II D)
  • Time the problem started/stopped and how it was resolved (if it was)
  • AWR/Statspack from problem time (Section II G)
  • OS Watcher or IPD/OS data from each node for the duration of then hang. (Section II E)
  • Standard alert & trace files (Section II A)
  • Any trace files printed in the alert.log during the performance hang.

General Hang and/or Performance issue

  • Clear description of problem
  • During the problem, you should gather systemstate and hanganalyze (Section II D)
  • Time the problem started/stopped and how it was resolved (if it was)
  • AWR/Statspack from problem time (Section II E)
  • OS Watcher or IPD/OS data from each node for the duration of then hang. (Section II E)
  • Standard alert & trace files (Section II A)
  • Any trace files printed in the alert.log during the performance hang.
Performance Issues

Performance issues are cases when Database is not performing optimally. Customer should try to explain why do they believe performance is bad. Common examples are

a.) Reports completed in X time yesterday compared to last week
b.) Insert/update/deletes are slow after moving from Single instance to RAC.

Some performance issues are side effects of applying an OS patch or increase in workload.To understand the cause of the performance issues, it is crucial to collect data like Statspack/AWR reports. It is also crucial to collect OS statistics. It is a good idea to collect statistics OS and DB related before any changes like applying a patch or adding a new node.

Hang Issues

A Database or instance hang is caused when a process is waiting forever. Eventually other processes queue behind this hung process and soon everything is hung. Oracle RAC DB has timeouts associated with crucial background processes that cause it to automatically dump diagnostic information when certain process is not responding. It should be noted that in some cases, a hang is not really a hang but a bad performance issue where the DB is so slow that Customers may incorrectly conceive it to be a hang.

If you are experiencing a hang, you should gather hanganalyze and systemstate dump, or run racdiag.sql which gathers this for you. This must be done DURING the hang.  In many cases, you can gather this information even if you are unable to login to the database.  Instructions are in section II(D) below.

II.  Data Gathering for All CRS and RAC Issues – How to Gather It

A.  Standard alert & trace files:

  • Alert.log from each node
  • Lms[0-9] trace file from each node.
  • lck trace file from each node
  • Lmon trace file from each node
  • diag trace from each node
  • Any trace file documented in the alert.log at the time of the issue.

B.  CRS logs:

  • Please follow the instructions in Step 4 of Note 330358.1 to run “diagcollection.pl”, and upload the resulting files to the SR.

C.  OS messages logs:

  • AIX:  Please upload the output of the “errpt” command and the “errpt -a” command, from both nodes.
  • Linux:  /var/log/messages
  • Solaris:  /var/adm/messages
  • HP-UX:  /var/adm/syslog/syslog.log
  • Tru64:  /var/adm/messages
  • Windows:  Save Application Log and System Log as txt files Using Event Viewer

D.  RACDiag:

  • Note 135714.1 Script to Collect RAC Diagnostic Information (racdiag.sql)
  • If the instance is hanging and you can’t log in to any instance, please do one of these 2 options instead:

1) Use “sqlplus -prelim / as sysdba” to get into sqlplus on one instance, and run the oradebug statements 2-3 times – 1 minute apart:

SQL> oradebug setmypid
SQL> oradebug unlimit
SQL> oradebug -g all hanganalyze 3
SQL> oradebug -g all dump systemstate 267

This only needs to be done on one instance; the “-g all” will kick off collection on all the other instances.

-or-

2) Capture systemstate dumps from each node with an OS debugger (gdb):

ps -ef |grep diag
gdb $ORACLE_HOME/bin/oracle <diag pid>
print ksudss(266)

This must be done on each node.

E.  OS Watcher or Cluster Health Monitor (IPD/OS):

OSW:
Cluster Health Monitor (IPD/OS) Windows:
  • Note 810915.1 How to Monitor, Detect and Analyze OS and RAC Resource Related Degradation and Failures on Windows
  • Note 811151.1 How to install Oracle Cluster Health Monitor (former IPD/OS) on Windows
Cluster Health Monitor (IPD/OS) Linux:

F.  RDA:

G.  Database Performance Data From All Nodes (AWR/Statspack):

Please gather AWR and ADDM reports (or Statspack, if not licensed to use AWR/ADDM or if version < 10.1):

  • From ALL nodes
  • Each report should have a 1 hour duration
  • Get several consecutive reports (eg. 9am-10am, 10am-11am, 11am-12am) from EACH node
  • If the problem occurs only at a specific time:
  • Get reports starting within 2 hours before the problem
  • Also get a report from when system was healthy.
Instructions for gathering these reports are here:
  • Note 94224.1 FAQ- Statspack Complete Reference
  • Note 276103.1 Performance Tuning Using 10g Advisors and Manageability Features (10g).  From this note, please also gather the information from the section “What Oracle Support needs to diagnose a performance problem in Oracle 10g”

© 2009, www.oracledatabase12g.com. 版权所有.文章允许转载,但必须以链接方式注明源地址,否则追究法律责任.

相关文章 | Related posts:

  1. UNIX: Checklist for Resolving Connect AS SYSDBA Issues
  2. Extracting Data from a Corrupt Table using DBMS_REPAIR or Event 10231
  3. Rac 10gR2 On AIX Best Guide
  4. Troubleshooting JVM Performance Problems
  5. Diagnosing Unsuccessful CRS root.sh Issues
  6. TROUBLESHOOTING: Tuning Queries That Cannot be Modified
  7. TROUBLESHOOTING: Possible Causes of Poor SQL Performance
  8. Troubleshooting a Database Tablespace Used(%) Alert problem
  9. Script to Collect RAC Diagnostic Information (racdiag.sql)
  10. Using udev with Oracle Architecture (RAC & ASM)

Leave a Reply

  

  

  

You can use these HTML tags

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>