Corruption: Information needed for Root Cause Analysis (RCA) ** INTERNAL ONLY **

作者: Maclean Liu , post on July 12th, 2010 , English Version
【本站文章除注明转载外,均为本站原创编译】
转载请注明:文章转载自: Oracle Clinic – Maclean Liu的个人技术博客 [http://www.oracledatabase12g.com/]
本文标题: Corruption: Information needed for Root Cause Analysis (RCA) ** INTERNAL ONLY **
本文永久地址: http://www.oracledatabase12g.com/archives/corruption-information-needed-for-root-cause-analysis-rca-internal-only.html

PURPOSE
=======
Often times after a corruption is encountered and even worked around, client’s
management start asking questions on how the corruption occurred and what they
can do to prevent it.

The purpose of this document is to guide the reader through the steps of
gathering the minimum amount of information needed for Oracle to do Root Cause
Analysis (RCA) that will help address customer’s concerns for those corruption
cases.

It is EXTREMELY important and a good practice to gather ALL the information
needed on the FIRST corruption occurrence even if client is not asking for RCA
and just wants to workaround the problem.  Often times, the same customer will
ask for the RCA at a later time!

SCOPE & APPLICATION
===================
The article is intended for Oracle Support Analysts.

CORRUPTION OVERVIEW
===================
Each block of an Oracle data file is formatted with a fixed header that
contains information about the particular block.  This information provides a
means to ensure the integrity for each block and in turn, the entire Oracle
database.

When Oracle reads a block from disk into the Buffer Cache or writes a block
from the Buffer Cache to disk, there are certain checks done on the block.
For example, one of the checks is to compare the lower order 2 bytes of SCNBase
(byte 10 & 11), the block Type (byte 0), and the Seq number (byte 14) and see
if it matches with the Tail (Byte C, D, E, F).

Oracle8 data block layout:
~~~~~~~~~~~~~~~~~~~~~~~~~

|   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |10 |11 |12 |13 |14 |15 |

+——-+——-+—————+—————+—————+
|Typ|Fmt|Filler |     RDBA      |    SCNBase    |SCNWrap|Seq|Flg|
|——-+——-+—————+—————+—————|
|ChkVal |Filler | <——                                       |
|——-+——-+                                               |
:                       Body of                                 :
:                        Data Block                             :
:                                                               :
|                                               +—————|
|                                    ———> |     Tail      |
+—————————————————————+

| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | A | B | C | D | E | F |
|   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |

There are various types of corruption that can occur in an Oracle environment:

o  Physical Corruption:

Can be defined as a damage to the internal data structure in a
way that prevents Oracle from being able to read the data in the
corrupted block/s.

Physical corruption can be due to a hardware or software and can
occur in MEMORY — including various I/O buffers and Oracle Buffer
Cache — or on DISK.

+       Memory Corruption – Memory corruption is usually identified by
a background or shadow process whenever it tries to access the
part of memory that is corrupted. For example, when a shadow
process tries to read or update a block in the Oracle Buffer
Cache OR when the DBWR background process tries to write
dirtied block from the Cache to Disk.

Common causes include:

- OS bugs causing bad reads or bad writes.
- Hardware issues.
- Oracle Bug.
- Non-Oracle program attaching and illegally writing
to the same shared memory address space.

+       Disk/Media Corruption – Media failures are failures that are
caused due to a hardware problems, operating system problems,
controller problems, logical volume manager problems, etc.
The data in the Media corrupted block will be lost.

o  Logical Corruption:

Involves Oracle being able to find the data, but the data values
are incorrect.

For the purpose of this document our only concern is Physical Corruption.

CORRUPTION ERRORS:
=================

Following are some corruption related ERRORS or Messages that you might be
dealing with:

I.    Getting an ORA-01578 :

ORA-01578: “ORACLE data block corrupted (file # %s, block # %s)”
Cause: The data block indicated was corrupted, mostly due to
software errors.
Action: Try to restore the segment containing the block indicated.
This may involve dropping the segment and recreating it.
If there is a trace file, report the errors in it to your
ORACLE representative.

II.    Getting an ORA-00600 with first argument in the range of 2000-8000

Code Layer        Range
—————–    ———–
Cache Layer         2000 -  4000
Transaction Layer     4000 -  6000
Data  Layer         6000 -  8000
Generic Layer        17000 – 18000  — Heap Corruption.

IV.    You found a trace files similar to this:

Corrupt block dba: 0x160c5958 . found during
buffer read on disk type:6. ver:1. dba: 0x160c5958 inc:0x0000da78
seq:0x000000fb incseq:0xda7800fc
Reread of block=160c5958 file=22. blocknum=809304. found same
corrupted data

V.    You ran ‘dbv’ against a file and you got messages similar to this:

Example_one:

DBVERIFY – Verification starting : FILE = /orad01_A/b91prd02/system01.dbf
kdbchk: the amount of space used is not equal to block size
used=213 fsc=12 avsp=7870 dtl=8096
Page 1147 failed with check code 10
DBVERIFY – Verification complete

Example_two:

DBVERIFY – Verification starting : FILE = /oracle/datafile01.dbf
kdbchk: a row ends in the middle of another
tab=0 slot=10 begin=0x2a6 len=0x15e
Page 56167 failed with check code 5
DBVERIFY – Verification complete

o    You analyzed an object and failed with :

ORA-01498 : “block check failure – see trace file”

OR

ORA-01499 : “table/index cross reference failure – see trace file”

BEFORE YOU WORKAROUND THE CORRUPTION
====================================

I suggest that you gather the following information before you attempt to work
around the corruption problem:

o    The database Alert log.

o    The database init<SID>.ora

o    Oracle trace files generated around the corruption time.

o    Operating System’s system logs.

o    Logical Volume Manager (LVM) logs.

o    Thorough history of events that led to the corruption.

o    Find out if there are any noticeable changes to the environment such
as new init<SID>.ora parameters, new code, Oracle patches, OS patches,
Oracle upgrade, and OS upgrade.

o    DBVERIFY on the corrupted file (assuming that the corruption
was not detected by running DBVERIFY in the first place):

% dbv file=<corrupted_file_name> blocksize=<oracle_block_size_in_bytes>

Example:

% dbv file=/orad01_A/b91prd02/system01.dbf blocksize=4096

DBVERIFY – Verification starting : FILE = /orad01_A/b91prd02/system01.dbf
kdbchk: the amount of space used is not equal to block size
used=213 fsc=12 avsp=7870 dtl=8096
Page 1147 failed with check code 10

Total Pages Examined         : 6400
Total Pages Processed (Data) : 1690
Total Pages Failing   (Data) : 1
Total Pages Processed (Index): 709
Total Pages Failing   (Index): 0
Total Pages Empty            : 2969
Total Pages Marked Corrupt   : 0
Total Pages Influx           : 0

o    ANALYZE on the object reporting the corruption (assuming that the
corruption was not detected by running the ANALYZE command) :

1_  Identify the corrupted object:

First, identify the FILE# and the BLOCK# of the
corrupted block:

a_  If ORA-01578 :

File# and Block# is part of the error message.

b_  If ORA-00600 :

Find if one of the ORA-00600 arguments represent
a Data Block Address (DBA) in Oracle7 or the
Relative Data Block Address (RDBA) for Oracle8 :

++ If it does :

Example:

ORA-00600 [3339] [805306632] [805306633]     (Oracle7)

Plug in the DBA/RDBA decimal value in WebIV –
DBA Calculations Utility — to get the corresponding
File# and Block#.

In this example argument 2 and 3 correspond to
DBA addresses .. lets take DBA 805306633 for an
example. If you plug in the DBA decimal value in
the DBA Calculations Utility you will get the
following results.

Converting DBA 805306633 (dec) ==>

DBA          (Hex)     =  File#, Block#
805306633 (0×30000109) =     12, 265

So File#  = 12
Block# = 265

++ If it doesn’t:

Find out if the ORA-00600 trace file has an Oracle
Block Dump in it .. you can search the trace
file for ‘Block header’.

Example of dump:

*** SESSION ID:(13.1324) 2000.04.11.11.57.04.036
Block header dump: dba: 0×30000109
Object id on Block? Y
seg/obj: 0xf92  csc: 0×00.32f3b9a1  itc: 2  flg: -
typ: 2 – INDEX fsl: 0  fnx: 0×0
….

Plug in the DBA/RDBA hex value in WebIV — DBA
Calculations Utility — to get the corresponding
File# and Block#.

In this example the DBA == 0×30000109 (hex),
plugging in this value in DBA Calculations Utility
we get the corresponding File#, Block# as follows:

Converting 0×30000109 (hex) ==>

DBA          (Hex)     =  File#, Block#
805306633 (0×30000109) =     12, 265

So File#  = 12
Block# = 265

c_  If you get a trace similar to the one listed under
CORRUPTION ERRORS — number IV — above:

File# and Block# are part of the trace — assuming
that the DBA itself is not corrupted!

Ex.

Corrupt block dba: 0x160c5958 . found during
buffer read on disk type:6. ver:1. dba: 0x160c5958
inc:0x0000da78 seq:0x000000fb incseq:0xda7800fc
Reread of block=160c5958 file=22. blocknum=809304. found same
corrupted data

Here the File#     = 22
Block# = 809304

d_  If the corruption was detected by running DBVERIFY:

File# is the corresponding file# of the file
you ran dbverify against.

select name, file# from v$datafile;

Block# is the Page# that failed from running
dbverify against the file.

Example:

DBVERIFY – Verification starting : FILE = system01.dbf
kdbchk: the amount of space used is not equal to block
size used=213 fsc=12 avsp=7870 dtl=8096
Page 1147 failed with check code 10
DBVERIFY – Verification complete

SVRMGR> select name, file# from v$datafile;
NAME                 FILE#
—————————     —-
/oracle/system01.dbf         1

File# for system01.dbf     ==>  1
Block# = failed Page#     ==>  1147

Once you identify the File# and Block#, you need to run the
following SQL statement to identify the corrupted object:

SELECT     owner, segment_name, segment_type
FROM     dba_extents
WHERE     file_id = <File#>
AND     <Block#> BETWEEN block_id AND block_id + blocks – 1;

2_  Analyze the object:

IF segment_type …

INDEX:
—–

ANALYZE    INDEX <index> VALIDATE STRUCTURE;

DATA:
—–

ANALYZE TABLE <table> VALIDATE STRUCTURE;

or

ANALYZE TABLE <table> VALIDATE STRUCTURE CASCADE;

Note down the errors and get all the trace files generated
from the analyze command/s.

o    Dump the corrupted block/s at the ORACLE level — assuming
that the trace files generated with the corruption error
does not have that dump already:

For detailed information on how to get a block dump at the
ORACLE level, please reference:

Note 45852.1  (Oracle8):
————————-

Example:

ALTER SYSTEM DUMP DATAFILE {‘filename’}|{filenumber}

|—BLOCK MIN {blockno} BLOCK MAX {blockno}|–>
|                                          |
|—-BLOCK {blockno}———————–|

Eg:
ALTER SYSTEM DUMP DATAFILE 1 BLOCK 5586;
ALTER SYSTEM DUMP DATAFILE 1 BLOCK MIN 5585 BLOCK MAX 5586;
ALTER SYSTEM DUMP DATAFILE ‘/u01/oradata/MYDB/system01.dbf’ BLOCK 98;

Note 43142.1  (Oracle7):
————————-

Example:
alter session set events ‘immediate trace name BLOCKDUMP level <DBA>’;

Where the DBA value is in DECIMAL.

o    Dump the corrupted block/s at the OS level:

For detailed information on how to get a block dumps at the
OS level, please reference Note 43142.1.

If data files are on ASM, please review the following notes:

Note 293095.1
Note 387001.1
Note 294727.1

EXAMPLES:
========

UNIX:
—-

% dd if= <data_file> bs= <oracle_block_size> skip=<BlockNo> \
count= <how_many_blocks> of=<output file>

where

if:        For Input File — the file we are dumping from.
of:           For Output File  — the file we are dumping to
bs:        Oracle block size.
skip:     number of blocks to skip == the block# we are trying to
dump.  It is useful to dump the block BEFORE and
AFTER the one required.
Eg: skip=Block# -1  count=3

count:    The number of block to dump.

Ex.

The following command dumps the BEFORE block, block#2345, and the
AFTER block in file ‘dbfile.dbf’ :

% dd if= dbfile.dbf bs=2k skip=2344 count=3 of=blocks.dd

VMS:

Dump  the block at the OS level in VMS:

$ DUMP/BLOCKS=(start:XXXX,end:YYYY)/out=dump.out <data-file>

Where  XXXX or YYYY =
Operating system block number (in 512 byte chunks).  To calculate these
numbers multiply the block number you are dumping  by
‘DB_BLOCK_SIZE/512′.

Ex.

The following command dumps the BEFORE block#231822, the
block#231823, and the AFTER block 231824 in file ‘dbfile.dbf’
assuming the Oracle data block size is 2k :

$ DUMP/BLOCKS=(start:927288,end:927296)/out=dump.out dbfile.dbf

927288 = 231822 * (DB_BLOCK_SIZE/512) = 231822 * (2k/512)
927296 = 231824 * (DB_BLOCK_SIZE/512) = 231824 * (2k/512)

o    If the database is running in ARCHIVE mode:

Dump the corrupted block from the ONLINE + ARCHIVE redo
logs.  It is hard to determine how far back in time we
need to dump the redo info for the corrupted block unless
you know a point in time when the block was *not* corrupted.
Potentially we might need all the redo dump for the block
back to that time.

Please reference Note 28989.1 for more detail
instruction on how to dump the redo info.

Example:

ALTER SYSTEM dump logfile ‘filename’
dba min <filno> . <blkno>
dba max <filno> . <blkno>;

Where ‘filename’ is the ONLINE or ARCHIVE redo log we are
currently dumping the block info from.

Note: There must be a space each side of the ‘.’ in the DBA

Ex.

The following command will dump block#100 of file#14
from the online redo log, log2a.dbf.

ALTER SYSTEM dump logfile ‘log4b.dbf’
dba min 14 . 100
dba max 14 . 100;

(NOTE:  You need to run the above alter command *multiple*
times — each time with different online or archive
log (logfile) depending from which redo log you are
dumping the block from. )

A trace file will be generated in the USER_DUMP_DEST directory.
Following is an example of how the beginning of the dump output
will look like:

*** SESSION ID:(9.32) 2000.05.24.15.02.30.000
DUMP OF REDO FROM FILE ‘/database/805/V805/log4b.rdo’
Opcodes *.*
DBA’s: (file # 14, block # 100) thru (file # 14, block # 100)
RBA’s: 0×000000.00000000.0000 thru 0xffffffff.ffffffff.ffff
SCN’s scn: 0×0000.00000000 thru scn: 0xffff.ffffffff
….

After you take the redo dumps and if it is all possible,
have client restore the corrupted file from a backup from
time were he/she thinks the block was not corrupted –
this could be verified by running DBVERIFY on the restored
file — and apply recovery on the file to bring it in
sync with the rest of the database files.

In addition to this simple recovery procedure of having the
potential of working around the problem, it will be a valuable
test for our root cause analysis process to find out whether
restoring the file and recovering it will re-introduces the
corruption problem or not?  Please make sure that you start
with a clean copy of the file, one where the block was not
corrupted.  Again, running a simple DBVERIFY test on the
restored file will identify whether the file is corrupted
or not!

REFERENCES:
==========

o    ”Prevention, Detection and Repair of Database Corruption”
COE Technical Report — Janani Mahalingam

© 2010, www.oracledatabase12g.com. 版权所有.文章允许转载,但必须以链接方式注明源地址,否则追究法律责任.

相关文章 | Related posts:

  1. Collecting Diagnostic information for Oracle block corruption
  2. Oracle Block Cache Layer And Its Corresponding Corruption Error Messages.
  3. FAQ Memory Corruption
  4. 9iRAC Tuning Best Practices (INTERNAL ONLY)
  5. How to Force the Database Open ** INTERNAL ONLY **
  6. ORA-00600 [KCBZPB_1], [59033077], [4], [1], [] example
  7. Script to Collect RAC Diagnostic Information (racdiag.sql)
  8. 11g New Feature: Health monitor

Leave a Reply

  

  

  

You can use these HTML tags

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>