了解你所不知道的SMON功能(五):Recover Dead transaction

SMON的作用还包括清理死事务:Recover Dead transaction。当服务进程在提交事务(commit)前就意外终止的话会形成死事务(dead transaction),PMON进程负责轮询Oracle进程,找出这类意外终止的死进程(dead process),通知SMON将与该dead process相关的dead transaction回滚清理,并且PMON还负责恢复dead process原本持有的锁和latch。

我们来具体了解dead transaction的恢复过程:

SQL> select * from v$version;

BANNER
----------------------------------------------------------------
Oracle Database 10g Enterprise Edition Release 10.2.0.4.0 - 64bi
PL/SQL Release 10.2.0.4.0 - Production
CORE    10.2.0.4.0      Production
TNS for Linux: Version 10.2.0.4.0 - Production
NLSRTL Version 10.2.0.4.0 - Production

SQL> select  * from global_name;

GLOBAL_NAME
--------------------------------------------------------------------------------
www.oracledatabase12g.com

SQL>alter system set fast_start_parallel_rollback=false;
System altered.

设置10500,10046事件以跟踪SMON进程的行为

SQL> alter system set events '10500 trace name context forever,level 8';
System altered.

SQL> oradebug setospid 4424
Oracle pid: 8, Unix process pid: 4424, image: oracle@rh2.oracle.com (SMON)

SQL> oradebug event 10046 trace name context forever,level 8;
Statement processed.

在一个新的terminal中执行大批量的删除语句,在执行一段时间后使用操作系统命令将执行该删除操作的
服务进程kill掉,模拟一个大的dead transaction的场景

SQL> delete large_rb;
delete large_rb

[oracle@rh2 bdump]$ kill -9 4535

等待几秒后pmon进程会找出dead process:
[claim lock for dead process][lp 0x7000003c70ceff0][p 0x7000003ca63dad8.1290666][hist x9a514951]

在x$ktube内部视图中出现ktuxecfl(Transaction flags)标记为DEAD的记录:

SQL> select sum(distinct(ktuxesiz)) from x$ktuxe where ktuxecfl = 'DEAD';

SUM(DISTINCT(KTUXESIZ))
-----------------------
                  29386

SQL> /

SUM(DISTINCT(KTUXESIZ))
-----------------------
                  28816

以上KTUXESIZ代表事务所使用的undo块总数(number of undo blocks used by the transaction)

==================smon trace content==================
SMON: system monitor process posted
WAIT #0: nam='log file switch completion' ela= 0 p1=0 p2=0 p3=0 obj#=1 tim=1278243332801935
WAIT #0: nam='log file switch completion' ela= 0 p1=0 p2=0 p3=0 obj#=1 tim=1278243332815568
WAIT #0: nam='latch: row cache objects' ela= 95 address=2979418792 number=200 tries=1 obj#=1 tim=1278243333332734
WAIT #0: nam='latch: row cache objects' ela= 83 address=2979418792 number=200 tries=1 obj#=1 tim=1278243333356173
WAIT #0: nam='latch: undo global data' ela= 104 address=3066991984 number=187 tries=1 obj#=1 tim=1278243347987705
WAIT #0: nam='latch: object queue header operation' ela= 89 address=3094817048 number=131 tries=0 obj#=1 tim=1278243362468042
WAIT #0: nam='log file switch (checkpoint incomplete)' ela= 0 p1=0 p2=0 p3=0 obj#=1 tim=1278243419588202
Dead transaction 0x00c2.008.0000006d recovered by SMON

=====================
PARSING IN CURSOR #3 len=358 dep=1 uid=0 oct=3 lid=0 tim=1278243423594568 hv=3186851936 ad='ae82c1b8'
select smontabv.cnt,
       smontab.time_mp,
       smontab.scn,
       smontab.num_mappings,
       smontab.tim_scn_map,
       smontab.orig_thread
  from smon_scn_time smontab,
       (select max(scn) scnmax,
               count(*) + sum(NVL2(TIM_SCN_MAP, NUM_MAPPINGS, 0)) cnt
          from smon_scn_time
         where thread = 0) smontabv
 where smontab.scn = smontabv.scnmax
   and thread = 0

END OF STMT
PARSE #3:c=0,e=1354526,p=0,cr=0,cu=0,mis=1,r=0,dep=1,og=4,tim=1278243423594556
EXEC #3:c=0,e=106,p=0,cr=0,cu=0,mis=0,r=0,dep=1,og=4,tim=1278243423603269
FETCH #3:c=0,e=47065,p=0,cr=319,cu=0,mis=0,r=1,dep=1,og=4,tim=1278243423650375
*** 2011-06-24 21:19:25.899
WAIT #0: nam='smon timer' ela= 299999999 sleep time=300 failed=0 p3=0 obj#=1 tim=1278243716699171
kglScanDependencyHandles4Unpin():
  cumscan=3 cumupin=4 time=776 upinned=0

以上SMON回滚清理Dead transaction的过程从”system monitor process posted”开始到”Dead transaction 0x00c2.008.0000006d recovered by SMON”结束。另外可以看到在恢复过程中SMON先后请求了’latch: row cache objects’、’latch: undo global data’、’latch: object queue header operation’三种不同类型的latch。

现象

fast_start_parallel_rollback参数决定了SMON在回滚事务时使用的并行度,若将该参数设置为false那么并行回滚将被禁用,若设置为Low(默认值)那么会以2*CPU_COUNT数目的并行度回滚,当设置为High则4*CPU_COUNT数目的回滚进程将参与进来。当我们通过以下查询发现系统中存在大的dead tranacation需要回滚时我们可以通过设置fast_start_parallel_rollback为HIGH来加速恢复:

select sum(distinct(ktuxesiz)) from x$ktuxe where ktuxecfl = 'DEAD';

==============parallel transaction recovery===============

*** 2011-06-24 20:31:01.765
SMON: system monitor process posted msgflag:0x0000 (-/-/-/-/-/-/-)

*** 2011-06-24 20:31:01.765
SMON: process sort segment requests begin

*** 2011-06-24 20:31:01.765
SMON: process sort segment requests end

*** 2011-06-24 20:31:01.765
SMON: parallel transaction recovery begin
WAIT #0: nam='DFS lock handle' ela= 504 type|mode=1413545989 id1=3 id2=11 obj#=2 tim=1308918661765715
WAIT #0: nam='DFS lock handle' ela= 346 type|mode=1413545989 id1=3 id2=12 obj#=2 tim=1308918661766135
WAIT #0: nam='DFS lock handle' ela= 565 type|mode=1413545989 id1=3 id2=13 obj#=2 tim=1308918661766758
WAIT #0: nam='DFS lock handle' ela= 409 type|mode=1413545989 id1=3 id2=14 obj#=2 tim=1308918661767221
WAIT #0: nam='DFS lock handle' ela= 332 type|mode=1413545989 id1=3 id2=15 obj#=2 tim=1308918661767746
WAIT #0: nam='DFS lock handle' ela= 316 type|mode=1413545989 id1=3 id2=16 obj#=2 tim=1308918661768146
WAIT #0: nam='DFS lock handle' ela= 349 type|mode=1413545989 id1=3 id2=17 obj#=2 tim=1308918661768549
WAIT #0: nam='DFS lock handle' ela= 258 type|mode=1413545989 id1=3 id2=18 obj#=2 tim=1308918661768858
WAIT #0: nam='DFS lock handle' ela= 310 type|mode=1413545989 id1=3 id2=19 obj#=2 tim=1308918661769224
WAIT #0: nam='DFS lock handle' ela= 281 type|mode=1413545989 id1=3 id2=20 obj#=2 tim=1308918661769555

*** 2011-06-24 20:31:01.769
SMON: parallel transaction recovery end

但是在real world的实践中可以发现当fast_start_parallel_rollback= Low/High,即启用并行回滚时常有并行进程因为各种资源互相阻塞导致回滚工作停滞的例子,当遭遇到这种问题时将fast_start_parallel_rollback设置为FALSE一般可以保证恢复工作以串行形式在较长时间内完成。

如何禁止SMON Recover Dead transaction

可以设置10513事件来临时禁止SMON恢复死事务,这在我们做某些异常恢复的时候显得异常有效,当然不建议在一个正常的生产环境中设置这个事件:

SQL> alter system set events '10513 trace name context forever, level 2';

System altered.

10531 -- event disables transaction recovery which was initiated by SMON

SQL> select ktuxeusn,
  2         to_char(sysdate, 'DD-MON-YYYY HH24:MI:SS') "Time",
  3         ktuxesiz,
  4         ktuxesta
  5    from x$ktuxe
  6   where ktuxecfl = 'DEAD';

  KTUXEUSN Time                         KTUXESIZ KTUXESTA
---------- -------------------------- ---------- ----------------
        17 24-JUN-2011 22:03:10                0 INACTIVE
        66 24-JUN-2011 22:03:10                0 INACTIVE
       105 24-JUN-2011 22:03:10                0 INACTIVE
       193 24-JUN-2011 22:03:10            33361 ACTIVE
       194 24-JUN-2011 22:03:10                0 INACTIVE
       194 24-JUN-2011 22:03:10                0 INACTIVE
       197 24-JUN-2011 22:03:10            20171 ACTIVE

7 rows selected.

SQL> /

  KTUXEUSN Time                         KTUXESIZ KTUXESTA
---------- -------------------------- ---------- ----------------
        17 24-JUN-2011 22:03:10                0 INACTIVE
        66 24-JUN-2011 22:03:10                0 INACTIVE
       105 24-JUN-2011 22:03:10                0 INACTIVE
       193 24-JUN-2011 22:03:10            33361 ACTIVE
       194 24-JUN-2011 22:03:10                0 INACTIVE
       194 24-JUN-2011 22:03:10                0 INACTIVE
       197 24-JUN-2011 22:03:10            20171 ACTIVE

7 rows selected.

================smon disabled trans recover trace==================

SMON: system monitor process posted
*** 2011-06-24 22:02:57.980
SMON: Event 10513 is level 2, trans recovery disabled.
  1. 您好,我有一个问题想请教您
    很多资料上面说回滚的以及死掉的会话的未提交事务都是由PMON进程负责清理,而启动时的死事务才由SMON清理,不知这种说法到底对不对。
    谢谢您。

    • 你好,这种说法是不正确的,如上文所述pmon会轮询后台和服务进程并找出dead process:

      pmon进程会找出dead process:
      [claim lock for dead process][lp 0x7000003c70ceff0][p 0x7000003ca63dad8.1290666][hist x9a514951]

      pmon会负责回收该进程所持有的锁和其他资源,但实际的事务回滚是由SMON完成的,如上文所述的

      Dead transaction 0x00c2.008.0000006d recovered by SMON
      SMON: parallel transaction recovery begin
      ……….
      SMON: parallel transaction recovery end

      可以通过10513事件禁止SMON做transaction recovery,如上文中的演示

  2. 您好,042中有一题是这样的:
    In the middle of a transaction, a user session was abnormally terminated but the instance is still up and the database is
    open. Which two statements are true in this scenario? (Choose two.)
    A.Event Viewer gives more details on the failure.
    B.The alert log file gives detailed information about the failure.
    C.PMON rolls back the transaction and releases the locks.
    D.SMON rolls back the transaction and releases the locks.
    E.The transaction is rolled back by the next session that refers to any of the blocks updated by the failed transaction.
    F.Data modified by the transaction up to the last commit before the abnormal termination is retained in the database.

    Answer : C, F

    所以我现在非常疑惑~谢谢您的回答~

  3. Pingback: 在Oracle中如何让SELECT查询绕过UNDO | Ask Maclean Oracle Blog

  4. Pingback: _OFFLINE_ROLLBACK_SEGMENTS隐藏参数 | Ask Maclean Oracle Blog

  5. Pingback: Crash/Instance Recovery与Media Recovery的本质区别 | Ask Maclean Oracle Blog

  6. 关于该问题 , Maclean有了新的认识:

    _cleanup_rollback_entries

    This is an integer parameter that defaults to 100. The parameter is used to specify the maximum number of undo (rollback) records to apply when performing transaction
    rollback under PMON.
    If the transaction rollback hits the limit then PMON aborts the recovery and signals SMON so that it can perform the recovery instead.
    It also looks like SMON may be influenced by this parameter too, if it is asked to do a quick pass (KTUREC_QP) recovery of a transaction.
    In the Oracle 8.0 Server Reference, the non-hidden version of this parameter is described as follows:
    CLEANUP_ROLLBACK_ENTRIES specifies the number of undo records processed at one time when rolling back a transaction. Prevents long transactions

    CLEANUP_ROLLBACK_ENTRIES

    Parameter type:
    integer

    Parameter class:
    static

    Default value:
    20

    CLEANUP_ROLLBACK_ENTRIES specifies the number of undo records processed at one time when rolling back a transaction. Prevents long transactions from freezing out shorter transactions that also need to be rolled back. Normally this parameter will not need modification.

  7. _cleanup_rollback_entries 默认值100, 单位为undo(rollback)记录的条数, 若小于100条则 pmon负责该事务的rollback。

    若事务需回滚的记录超过100条rollback record则PMON只负责清理dead session,而由SMON负责回滚dead transaction。

    所以关于 到底是 PMON还是SMON做rolls back the transaction这个工作的公案,算有一个合理的结论了,如上。