3.1 Buffer Cache and Shared Pool
3.1.1 Buffer Cache Sizing
The database buffer cache is used to keep database blocks in memory, so that we do not have to perform a disk read when we want to access the same database block again. Therefore we also get a lower latency to access a database block as well as lower CPU cost. In a RAC environment; when we do not find the database block in the mode we want, we may read it from disk or get it from remote cache through the interconnect. For example, when we want to modify a block, which has been modified by another Oracle instance in the cluster and it is still in that instance’s cache, then we get this block from that instance. It is transparently done by RAC; there is no need for application to be aware of this mechanism.
The recommendation for sizing the buffer cache is similar to the single node case, i.e., the same tuning procedures apply. Although actual memory requirements will depend on the application workload, we recommend an about 10% bigger buffer cache in a scale up situation (example: 1 node – 1000 users, 2 nodes – 2000 users). The additional 10% is needed because some blocks may be cached in more than one instance, depending on how the application accesses data. The values are heuristic, based on our experience in sizing and tuning RAC systems.
Take into consideration that the memory requirements per instance are reduced when the same user population is distributed over multiple nodes (example: 1 node – 1000 users, 2 nodes – same 1000 users ). Assuming the same user population, N number of nodes and M buffer cache for a single system, a better way of calculating is
(M / N) + ((M / N)* 0.10) [+ Me]
A buffer cache which was well sized for a single instance divided by the number of nodes plus an overhead per node (10%) plus additional buffer cache (Me) in case one node crashes and the users fail over to the surviving node. Please bear in mind that the actual memory usage depends on the actual requirements of the application. Example: M=2 GB, N=2
(2GB / 2) + ((2GB / 2)) * 0.10 [+ extra memory to compensate for
failed-over users]
= 1GB + 100M = 1.1GB
If memory is abundant in the system, an alternative approach is to start with a larger overhead, over-sizing the buffer cache initially, and then using the buffer cache advisory feature in order to optimize memory allocation. The buffer cache advisory data can be accessed by selecting from the dynamic performance view V$DB_CACHE_ADVICE. The most relevant column is ESTD_PHYSICAL_READ_FACTOR, which describes the expected savings in physical reads as a percentage of all reads. Please note that for multiple instances, the calculation unfortunately is not accurate, because the cache advisor currently only distinguishes between a disk read and a cache hit. It counts global cache transfers (buffer neither comes from disk nor from the local cache) as disk I/O and therefore overestimates the I/O volume and the effect of a bigger or smaller cache on physical disk I/O. The occurrence of global cache transfers is dependent on access concurrency rather than buffer cache size and therefore not suited in a linear model relating disk I/O to buffer cache size. For that reason, the results from the buffer cache advisory might be misleading and give only a rough estimate in percentage about the expected reduction in physical I/O’s.
For concept and tuning recommendation related content for the database buffer cache, please refer to the Oracle9i Release 2 Performance Guide and Reference.
3.1.2 Reducing inter-node communication by improving buffer cache hit ratio
Tuning the buffer cache may improve the database response time and total throughput not only because of reduction in the disk I/O, but also reduction in the interconnect traffic.
We wait for the events below when we do not have the database block in the local cache in the right mode and either if the master node of GCS resource is remote, or there is a remote Oracle instance that holds the block. If the count of these events is close to the number of physical reads, that means that we are waiting for these events because we just do not have the block on the local cache, and not because there is other Oracle instance that is holding the block. So if you see these system wait events below in the top most wait events from V$SYSTEM_EVENT (please refer to 4.2.1 Global Cache Events) or a Statspack report, and the count of these events (excluding the count of timeouts) is close to the number of physical reads, and the closer they are, the higher the chances that you get better performance by tuning the buffer cache and experience decreased wait times for these global cache events.
- global cache cr request
- global cache open x
- global cache null to x
- global cache open s
- global cache null to s
3.1.3 Reduction of CR buffers and PI buffers for increased cache hit ratio
PI (Past Image) is a buffer that was modified by an Oracle instance and is sent to another instance to be modified. Until Oracle8i, the block must be written to disk before the other instance can read it from disk. In Oracle9i RAC with Cache Fusion, the block is not written to disk but becomes a PI. CR (Consistent Read) is not specific to RAC, but there are more cases that CR buffers are created in RAC environment than single instance environment.
Running an OLTP benchmark using RAC, we have noticed that CR and PI buffers occupy a significant part of the buffer cache. For example:
SQL> select status, count(status) from v$bh group by status;
STATU COUNT(STATUS)
—– ————-
cr 39782
pi 21159
scur 60729
xcur 22693
Having too many CR and PI buffers practically reduces the size of the buffer cache, since we keep less number of more useful buffers (SCUR / XCUR) in the cache.
One easy way to reduce PI buffers on the cache is to perform incremental checkpointing by setting the FAST_START_MTTR_TARGET[1] parameter in the init.ora. Doing incremental checkpointing reduces the chance of dirty buffers requested from remote instance, and therefore, reduces the number of PI buffers on the cache.
For this particular workload, after setting FAST_START_MTTR_TARGET=60, the status of the cache changed as followed:
SQL> select status, count(status) from v$bh group by status;
STATU COUNT(STATUS)
—– ————-
cr 39448
pi 2
scur 65834
xcur 33094
Note:
This totally depends on the application workload characteristics. If it is a very heavy write intensive application, we may still see a lot of PI buffers in the cache, since DBWR can not keep up with flushing dirty buffers, and those dirty buffers become PI by being requested by other instances.
Now, there are still a huge amount of CR buffers in the cache. They can be either
- Created locally for read consistency
- Received from a remote instance
- Clean SCUR / XCUR buffers changed to CR (because remote instance requested the block in XCUR mode).
Until 9.0.1.2, there was an issue filed under bug# 1895220 where CR buffers are treated as equally important as SCUR and XCUR buffers, and SCUR buffers and XCUR buffers could be pushed out of the cache earlier than CR buffers. The fix to this bug was to put CR buffers at the cold end of the LRU. After addressing this issue (this fix is available in Oracle9i Release 2 and 9.0.1.3), the status of the buffer cache looked like follows:
SQL> select status, count(status) from v$bh group by status;
STATU COUNT(STATUS)
—– ————-
cr an>
free 3
pi 11
scur 71210
xcur 65465
3.1.4 Additional RAC Memory Requirements
The memory overhead, that RAC introduces, includes the following, main components in the layers listed below:
- Buffer cache layer (KCL)
- KCL global cache elements
- KCL bast context freelists
- Global cache service layer (GCS)
- GCS resources
- GCS shadows
- Global enqueue service layer (GES)
- GES resources
- GES enqueues
- GES message buffers
- GES resource hash table
- GES process array
RAC specific memory is allocated at the time of SGA creation, mostly in the shared pool. However, the memory does not count in the shared_pool_size as defined in the init.ora parameter file. Therefore, when migrating Oracle from single instance to RAC, you don’t need to adjust the shared_pool_size parameter to accommodate the additional memory that RAC uses; Oracle does that automatically. Note that when Oracle9i-style buffer cache parameters are used (e.g. db_cache_size), the memory for the KCL global cache lock elements will be allocated from the granule (i.e. the data block buffers), and not in the shared pool.
While the additional memory for GCS components and KCL global cache elements largely depends on the size of the buffer cache, the memory required for the GES related components and KCL bast context freelist depends on the value of certain init.ora parameters.
The exact RAC memory resource usage of the shared pool can be monitored by querying V$SGASTAT for GCS, GES and KCL related entries. Also, the current and maximum number of GCS resources / shadows (resource names gcs_resources and gcs_shadows) and GES resources / enqueues (resource names ges_ress and ges_locks) can be obtained from V$RESOURCE_LIMIT. In case we are running out of reserved memory for GCS / GES related components, Oracle will dynamically allocate memory from the free memory pool in the shared pool. However, it is recommended to carefully calculate the resources and enqueues as needed since dynamic memory allocation from the shared pool is an expensive operation.
The sections on the following pages discuss the memory allocations for each layer in detail.
Important note:
The formulas used here are very simplified in certain cases to facilitate understanding of the dependencies and to allow manual calculation of the extra memory overhead. The actual formulas include more, mostly internal parameters, hence actual vs. calculated memory requirements can deviate by about 10%. The calculations also do not include all possible components of the respective layers. All measurements were performed on HP-UX 11.0 64bit, Oracle9i Release 2.
KCL Layer
The memory overhead for the KCL layer is a linear function of the number of block buffers. Each data block needs about 288 bytes for the global cache element on a 64-bit system. The number of data blocks in memory depends on what buffer cache init.ora parameters you use:
db_cache_size (Oracle9i)
When Oracle9i-style buffer cache parameters are given, e.g. db_cache_size = 419430400 (equal to 102400 4KB buffers), then the buffer headers and the global cache elements reside in the granule, and not in the shared pool; therefore number of data block buffers is actually less than 102400, for both RAC and non-RAC. Assuming that the db_block_size is 4KB, the size of the buffer header and global cache element is about 576 bytes per data block, then the estimated number of data block buffers is
N_buf_4k = 419430400 /(4096+576) = 89775.
Thus, the memory overhead for the KCL layer is
M_KCL_4k = 288 * 419430400 /(4096+576) = 24.7M
If we use a cache for a larger db block size, e.g. db_8k_cache_size = 419430400 (51200*8KB), the memory overhead will be
M_KCL_8k = 288 * 419430400 /(8192+576) = 13.1M
The higher the block size, the less memory overhead it is. The Oracle9i buffer cache can be a group of buffers with different block size (e.g. db_<n>k_cache_size), one can use this formula to calculate the overhead for each buffer pool, and then add them up to get the overall memory overhead of the KCL layer. Note that the buffer headers always exist, and that it is not an overhead RAC introduces.
db_block_buffers (pre-Oracle9i)
If a pre-Oracle9i-style buffer cache parameter is given, e.g. db_block_buffers = 102400 and db_block_size = 4k, we need roughly
M_KCL_4k = 288 * 102400 = 28.1M
additional memory. This memory is reserved in the shared pool (the gc_* pool), but it is beyond the shared_pool_size given in the init.ora parameter file.
Other KCL memory allocations in the shared pool may include
KCL bast context freelist = <#processes> * 7056 = 3.4M
where 7056 bytes = 63 (maximum number of instances) * 112 bytes (size of a KCL bast context freelist structure on HP-UX 64bit).
GCS Layer
The memory overhead for the GCS layer is also proportional to the number of database blocks in the cache. The additional GCS related memory is reserved in the shared pool, in the GCS resources pool and the GCS shadow enqueue pool. The overhead is 128 bytes and 72 bytes respectively for GCS resources and shadows for each data block on a 64-bit Unix platform. For example, given the number of blocks as N_buf_4k = 89775, the memory overhead for GCS layer is roughly
M_GCS_RES = 128 * (89775 * 1.1) * 1.16 = 14.0M
M_GCS_SHA = 72 * (89775 * 1.1) * 1.35 = 9.2M
The number of buffers will internally be increased by 10% (factor 1.1), and this result (98752) will be used for the initial allocation for GCS resources and GCS shadows. The factors 1.16 (for both 32bit and 64bit platforms) for GCS resources and 1.35 (1.24 for 32bit platforms) for GCS shadow enqueues respectively, denote an adjustment due to internal memory management.
Note:
In Oracle9i Release 2 (9.2.0.1) and earlier, you may notice that with RAC linked in, shared pool memory would be allocated for GCS resources and shadow resources if the instance has been started in exclusive mode (cluster_database=FALSE). This unnecessary memory allocation will not occur starting with 9.2.0.2.
GES Layer
Unlike for the KCL and GCS layer, the memory overhead for the GES layer is not proportional to db_block_buffers. Instead, it mainly depends on the following init.ora parameters: processes, dml_locks, transactions, db_files, enqueue_resources, and sessions. We will just use the number of GES resources as basis for our simplified model.
GES resources
The number of the GES resources can be roughly estimated as
(processes + dml_locks + transactions + db_files +
enqueue_resources + 200) * (1 + (N-1)/N) ,
where N is the number of instances on the cluster. The enqueue_resources can be set explicitly in the init.ora file. Otherwise, the default value is typically equal to (dml_locks + db_files + 20). For example, when given the following init.ora parameters:
processes = 500
dml_locks = 5000
transactions = 800
db_files = 500
enqueue_resources = 7000
sessions = 555 (derived from <processes>)
The number of the GES resources on a 4-node cluster can be estimated as
N_GES_RES = (500+5000+800+500+7000+200)*(1+3/4) = 24500
The actual number of GES resources can be found in V$RESOURCE_LIMIT or in the alert.log:
Global Enqueue Service Resources = 25587, pool = 4
Given that the average size of a GES resource is about 248 bytes on a 64-bit Unix platform, the memory overhead of the GES resources is roughly
M_GES_RES = 248 * 25587 = 6.0M
GES enqueues
The number of the GES enqueues can be estimated as
N_GES_RES + (10 * sessions) + db_files + (2 * processes),
where N_GES_RES is the number of GES resources as calculated in the previous paragraph. Using the same values of the example above, we have
N_GES_ENQ = (24500+10*555+500+2*500) = 31550
The actual number of GES enqueues can be found in V$RESOURCE_LIMIT or in the alert.log as well.
Global Enqueue Service Enqueues = 31646
Given that the average size of a GES enqueue is about 328 bytes on a 64-bit Unix platform, the memory overhead of the GES resources is roughly
M_GES_ENQ = 328 * 31646 = 9.9M
GES message buffers
Additional shared pool memory for the GES layer, on top of GES resources and enqueues, will come from GES message buffers. The number of message buffers largely depends on the number of processes as defined in the init.ora. At startup, we allocate
- 230 + (2 * <#processes>) number of regular message buffers
- 230 + (2 * <#processes>) number of big message buffers
- MIN(<#processes>*2, 1000) number reserved message buffers
The size for various message buffers can be obtained with the following statements:
SQL> oradebug setmypid
SQL> oradebug call kjcsdmp
Important disclaimer:
Do NOT use this oradebug call in 9.0.1.2 or earlier, as this may crash the instance. See bug 1657912 and Appendix A for further reference. This issue is fixed in 9.0.1.3 and 9.2.
An excerpt from the resulting trace file shows:
reg_msg: sz 480 cur 0 max 0 ini 12230
big_msg: sz 2096 cur 0 max 0 ini 7763
rsv_msg: sz 2096 cur 0 max 0 tot 1000
Given that processes is set to 500 in our example, the additional memory requirement for GES message buffers can then be calculated as followed:
reg_msg_buf_size = (230 + (500 * 2)) * 480 = 590400 = 0.6M
big_msg_buf_size = (230 + (500 * 2)) * 2096 = 2578080 = 2.5M
rsv_msg_buf_size = MIN(500 * 2, 1000) * 2096
= 1000 * 2096 = 2096000 = 2.0M
======= ====
M_GES_buf = 5098400 = 5.1M
Other GES components
Additional memory is also allocated for
GES process array: <#processes> * 1250 = 0.6M
GES resource hash table: <#GES resources / 4) * 304 = 2.4M
where 304 bytes = average hash bucket structure size (208 bytes on Solaris 32bit), and 1250 bytes the average process structure (884 bytes on Solaris 32bit). For other platforms, this will vary. It needs to be noted that the number of GES resources here is internally rounded up to 32768 in order to calculate the memory for the GES resource hash table.
Summary of memory requirements
The additional memory requirements, based on the parameters in the examples above, are as followed:
KCL layer: 28.1M (24.7M + 3.4M; 7.0% of the buffer cache)
GCS layer: 23.2M (14.0M + 9.2M; 5.9% of the buffer cache)
GES layer: 24.0M (6.0M + 9.9M + 5.1M + 3.0M)
=====
75.3M
It should be noted that the relative memory overhead becomes smaller with bigger block sizes.
3.2 Database Layout: General Guidelines
For most efficient access to data, the following things should be taken into consideration:
- When INSERTING, UPDATING or DELETING, high concurrency on small working sets may affect response times. Concurrent write access to rows in the same block can result in increased block access times. If the local cache affinity, i.e. the retention and availability of data for local users decreases and blocks are shared frequently among instances, the probability that local and remote users contend for the same blocks grows and additional remote memory access latencies are added to the transactions. Additional latencies result when
- A locally cached block is requested by a remote instance but the release processing is deferred because there are active transactions pending in the block (“_gc_defer_time”, c.f. chapter 6 RAC related init.ora parameters)
- Redo for recent changes must be written to the log file before sending the block to another instance, hence requiring a log file I/O and a wait for the notification that the redo was written.
- When multiple blocks are scanned sequentially, as in long full table scan, or even in random multi-block I/O’s, as in index range scans, the number of blocks accessed and therefore the chance that data is not locally cached and require a read from disk or remote cache may increase. This may add additional latencies and CPU overhead to a query. On some platforms where low-overhead IPC protocols are not available, care should be taken to identify such queries and optimize their execution plans (c.f. 3.3 SQL Execution)
By laying out the database design data according to their expected data access characteristics, contention and I/O overhead can be optimized and reduced.
3.2.1 Different Block Sizes
As of Oracle9i, Oracle allows having different database block sizes in the same database. Taking advantage of this feature can make a larger difference in a RAC environment than in a single instance environment. If it is used appropriately, the number of disk I/O’s and relevant GCS operations should be reduced.
The general recommendation for determining the size of a database block is the same as single node environment. Follow the guideline documented in ‘Oracle9i Database Performance Guide and Reference Release 2, Chapter 15 I/O Configuration and Design, Choosing Data Block Size’.
New init.ora parameters allow the size of these separate caches to be configured individually, e.g.
db_2k_cache_size
db_4k_cache_size
etc.
Larger block sizes are useful when
- Long or short table scans are frequent, or objects are read-mostly. An application would have to read less data from disk or from another cache, consequently doing away with the overhead of opening a lock for a block read from disk and increasing the local cache hit rate,
- A large amount of data is inserted and there is no write concurrency, such as in data loads, or queue tables. In this case, a bigger block size allows more data to be added to a block before new space needs to be allocated or an index splits (c.f. discussion of overhead with index splits and segment space management in the sections 8.4 and 8.5). This works for data warehousing applications with multiple parallel insert streams, or OLTP applications where multiple queuing tables with instance affinity can be used and users on the same instance submit jobs to these tables which are then picked up by a job queue process.
Needless to say, the possibility that blocks accessed for modification from multiple instances may cause more contention is higher with large block sizes and this option must be carefully considered. Using large block sizes, as stated in the above examples, can be important optimizations for CPU overhead and for latencies.
Smaller block sizes should be taken into consideration
- For applications which access data in the same block frequently
However, the trade-off in this case is that more blocks need to be read from disk. Therefore, it may be an important response time optimization at the cost of higher CPU overhead for I/O, but it can reduce the probability of contention.
It may be difficult to decide which block sizes to choose, unless there are clear-cut application pattern. However, there are various approximations, such as
- Tracing a typical application with event 10046 and looking at the block access pattern for the top resource consuming statements, preferably building a histogram of block access (this relies on the occurrence of db file or global cache I/O waits and is therefore not a fully comprehensive method, but allows to identify the blocks with most local cache misses)
- Analyzing the segment statistics for particular segments with regard to their logical, physical and remote read characteristics
- Obtaining the SQL plans of queries that are frequently executed and considering their I/O characteristics (occurrence of contiguous reads, db file scattered read waits, index scans etc.).
- Determining the proximity of data block addresses for a given segment by running queries against V$BH or V$CACHE and comparing the distribution of buffers for file# and block#. For example, if there is a SQL statement that frequently accesses 2 adjacent blocks using 2KB block size, we should use a 4KB block size or more for this object, so that we read them in a single disk read.
Please note that a decision about the right block size to avoid contention is difficult, as there is no clear method by which the best size can be derived. However, in most cases the methods employed for single-instance layout decision are sufficient and no RAC specific considerations are necessary. The primary goal for a choice of block size should be defined as either to reduce physical I/O or global cache transfers (larger blocks), or to avoid busy buffers due to high concurrency (smaller blocks, fewer rows per block). Please also refer to 3.2.1 Different Block Sizes.
3.2.2 Separation of read-mostly and read write tablespaces
As an extension of the points made about different block sizes for tablespaces, it can also useful to separate objects into tablespaces based on their principal data access properties. Not only does this lend itself to consideration of different block sizes, but also facilitates ad-hoc setting of locks covering multiple data blocks.
In some cases, such as in DSS or data warehousing environments, or in mixed applications, increasing the ratio of locks to data blocks may reduce overhead for queries and should be considered if
- Access to data is read-mostly
- Cache sharing is low, i.e. not many blocks will be served from another instance (c.f. section 8.6)
It is sometimes known beforehand or can be inferred from V$SEGMENT_STATISTICS, whether a segment is read-mostly and/or shared data with other instances; in these cases, the suggested changes are optimizations directed towards reducing CPU cost.
3.2.3 Hash or Range Partitions
Hash or range partitioned tables and local indexes should be deployed mainly to reduce contention on index blocks when inserting frequently into segments with indexes that are possibly right growing. Generally speaking, contention can become statistically evident as
- Enqueue waits (mainly for TX enqueues)[2]
- Buffer busy due to global cache
- Global cache busy
- Global cache null to x
- Exchange deadlocks and leaf block splits
in a Statspack report and usually affects response times.
A right-growing index is a characteristic hot spot for OLTP applications and is due mainly to the fact that keys are generated in a more or less ordered increasing sequence. Therefore the leaf blocks will heat up. Moreover, frequent splitting of indexes may introduce further serialization points (c.f. “8.4 TX Enqueue Waits and Buffer Busy Waits on Index Branch or Leaf Blocks” for a discussion of index block splits).
Distributing the access to the leaves over multiple index partitions alleviates those hot spots. Apart from a significant reduction of contention, local cache affinity improves because leaf blocks are retained in a local cache longer and are more available for local users.
It should be noted that SQL execution might be seriously affected when partitioning tables and indexes, mainly because index range scans may need to access all index partitions, increasing logical reads. Partition elimination will only occur for statements containing equality predicates on the partition key in their WHERE clause ( like “WHERE <partition_key>=:b1”).
3.2.4 Automatic Segment Space Management
For ease of use and administration, it is a good practice to use automatic segment space management (also known as bitmap segments). This feature has been implemented in Oracle9i to replace segment space management by freelists and freelist groups. In most of the OTLP tests and benchmarks, it was shown that this feature performs as well or better for high volume, high rate insert transactions than freelist group managed segments.
It should be mentioned that for massive deletions from large objects, automatic space management might incur slightly more overhead because bitmap blocks need to be read and maintained. However, this will only affect deletions of a large number of rows.
Another feature of automatic space management is that it obsoletes the need to pre-allocate extents manually to freelist group blocks. Therefore, the feature of manually allocating extents to a segment is not available. It is not expected that extent allocation will ever have a significant impact when a combination of automatic segment space management and locally managed tablespaces is used for rapidly growing objects.
3.2.5 Binary large objects
If several processes are trying to insert into the same LOB segment, you might encounter contention. In order to reduce this contention, freelist groups can be defined for this LOB segment by specifying the FREEPOOLS parameter in the CREATE TABLE statement. This feature can be used if the database is running in manual or automatic undo mode, contrary to what is stated in the documentation
3.3 SQL Execution
The same sound SQL tuning rules that apply for single instance are still valid for a RAC environment. Additionally, there are some topics that have turned out to be important:
- Parse failures
- Auditing
- Full table scans
3.3.1 Parse failures
During benchmarks, we have seen that parse failures can limit the scalability of an application by up to 30%. Parse failures happen if the statement is syntactically not correct (for instance ORA-904) or not all the necessary privileges have been granted. Starting in Oracle9i, you can check for parse failures if you look for the row “parse count (failures)” in the view V$SYSSTAT; ideally, the value should be 0.
SELECT name, value
FROM v$sysstat
WHERE name = ’parse count (failures)’;
Sometimes it may be difficult to figure out which parts of the application are responsible for the parse failures. Two things can be done to isolate the statements:
- Check session statistics
- Set a database event
The first thing to do is to query the session statistics for parse failures:
SELECT st.sid, sn.name, st.value
FROM v$sesstat st, v$statname sn
WHERE st.statistic# = sn.statistic#
AND sn.name = ’parse count (failures)’
AND st.value > 0
ORDER BY 3 ;
Next, trace these sessions by setting the following event in the session to check for statements that are getting the ORA-904 error (invalid column name):
oradebug setospid
oradebug event 904 trace name errorstack 1
oradebug tracefile_name
3.3.2 Auditing
In a RAC environment, enabling Auditing has the negative side effect that this creates a lot more shared library cache locks. This already happens if auditing is turned on by the init.ora parameter audit_trail. The amount of additional shared library cache locks that has to be considered is dependent on the application and which auditing options are turned on. The internal Oracle Applications implementation enabled Auditing for CONNECT and specific tables; this increased the number of shared library cache locks by 50%. Other applications may show different results. It may not be necessary to audit any statements or privileges.
The advice is to not enable Auditing in a production environment.
3.3.3 Logon/Logoff triggers
Starting with Oracle8i, logon and logoff triggers can be defined for a database. Using logon/logoff triggers and using database auditing at the same time, can have a negative effect on database performance; in particular, the number of library cache lock acquisitions will increase.
3.3.4 Full Table scans
Full table scans in a RAC environment are even more expensive since interconnect traffic is involved. We are not saying that full table scans are bad, but care should be taken to only use them where they are of benefit. Otherwise, the Global Cache Service will have to service a lot of block requests that are artificially increasing the workload.
To check if Full Table Scans are used for large tables, run the following query:
SELECT name, value
FROM v$sysstat
WHERE name = ’table scans (long tables)’;
If there are Full Table Scans for large tables, then the next step is to check if the execution plans are reasonable. Starting from Oracle9i on, the execution plans are cached, and can be accessed through the data dictionary view V$SQL_PLAN.
SELECT object_owner, object_name, bytes
FROM v$sql_plan
WHERE operation = ’TABLE ACCESS’
AND options = ’FULL’
AND object_owner NOT IN (’SYS’, ’SYSTEM’);
Another way to identify SQL statements which are expensive in terms of CPU and memory usage is the view V$SESSION_LONGOPS.
SELECT sl.sid, sl.serial#, sl.opname,
sl.sofar, sl.totalwork, sl.time_remaining,
sl.username, sl.sql_hash_value
FROM v$session_longops sl
WHERE sl.totalwork > 50000
AND sl.time_remaining > 0
ORDER BY sl.totalwork desc;
This view collects information about table scans, hash joins, advanced queuing and recovery manager operations. The advantage of the view is that the information does not get removed from the view immediately after the query or operation has finished. Additionally, the username, the hash value for the SQL statement and the amount of work that has been done can also be investigated from this view.
3.4 Sequences
3.4.1 Sequence properties and characteristics
Sequences are used quite frequently in many applications in order to create sequential numbers. Sequences have two performance related properties: caching and ordering. Each of those properties can basically be enabled / disabled, and by default, sequences are not ordered, and have a cache size of 20. The general recommendation is to always use sequences with the CACHE clause and a reasonably large cache size, for both NOORDER and ORDER sequences, unless gapless sequences are required.
These are the available sequence options:
- CACHE
- CACHE / NOORDER
This setting has the least performance impact on RAC, and it is the default when creating a new sequence and you do not specify NOCACHE or ORDER. Each instances caches a distinct set of numbers. Sequences will not be ordered, and gaps will occur when instances shut down.
- CACHE / ORDER
Each instances caches the same set of numbers. Ordering of sequence numbers is guaranteed, and gaps will occur. Performance is better than with NOCACHE sequences. The synchronization / communication of the caching state is done via an SV instance lock on the resource i.e. sequence. So when LCK0 receives the lock grant, it can forward the lock value (i.e. new sequence number) to the foreground process (requestor).
For sequences that are frequently used, the cache size should be increased up from its default to a value that depends on how frequently you access the sequence cache. If you have not cached your sequences or your cache size is too small, you will see contention for the SQ enqueue in the V$ENQUEUE_STAT data dictionary view.
SELECT eq_type, cum_wait_time
FROM v$enqueue_stat
ORDER BY 2;
If you expect to have so-called “login storms”, i.e. many users log in at the same time, it is a good idea to increase the cache value of the sequence SYS.AUDSES$:
ALTER SEQUENCE sys.audses$ cache 10000;
- NOCACHE
- NOCACHE / NOORDER
Use this setting when e.g. government regulations or laws legally require sequence numbers without gaps. Ordering is not guaranteed. Performance over NOCACHE / ORDER is increased.
- NOCACHE / ORDER
Use these settings when gapless and ordered sequences are required; no gaps will occur and ordering is guaranteed. This setting though has the biggest performance impact on RAC.
A side effect of ordered and / or non-cached sequences in RAC can be a high wait time for ‘row cache lock’ waits for the DC_SEQUENCES row cache. The number of GET requests, modifications, GES requests and GES conflicts may correlate to the number of executions for a particular SQL statement that obtains a new sequence number.
Excerpt from a Statspack report:
CPU per Elap per
Executions Rows Processed Rows per Exec Exec (s) Exec (s) Hash Value
———— ————— —————- ———– ———- ———-
138,400 138,400 1.0 0.01 7.36 704652381
select psindex_seq$.nextval from dual
…
…
Get Pct Scan Pct Mod Final
Cache Requests Miss Reqs Miss Reqs Usage
————————- ———— —— ——- —– ——– ———-
dc_sequences 138,400 99.7 0 138,400 2
————————————————————-
GES GES GES
Cache Requests Conflicts Releases
————————- ———— ———— ————
dc_sequences 138,645 137,917 0
————————————————————-
3.4.2 How sequence caching can reduce index leaf block contention
Indexes with key values generated by sequences tend to be subject to leaf block contention when the insert rate is high. That is because the index leaf block holding the highest key value is changed for every row inserted, as the values are monotonically ascending. In RAC, this may lead to a high rate of current and CR blocks being transferred between nodes, and the performance degradation symptoms may include those described in:
- 8.1 CR requests and Buffer waits on remote undo segment headers
- 8.4 TX Enqueue Waits and Buffer Busy Waits on Index Branch or Leaf Blocks
Reducing inter-instance concurrent access to index leaf blocks can alleviate these performance problems. One of the simplest techniques that can be used for that purpose is to increase the sequence cache.Note that this potential performance improvement can only be obtained with sequences configured as CACHE NOORDER.
For example, let’s suppose an index key value is generated by a CACHE NOORDER sequence, and each index leaf block can hold 500 rows:
- If the sequence cache is set to the default (20): while instance 1 inserts values like 1, 2, 3, etc., instance 2 concurrently inserts 20, 21, 22, etc. Since the difference between the values is a lot smaller than the number of rows in the block, the two instances will keep changing the same index block as sequence values increase.
- If the sequence cache is set to 50000: while instance 1 inserts values 1, 2, 3, etc., instance 2 concurrently inserts 50001, 50002, etc. In the beginning, both instances will be writing to the same leaf block, but after some block splits this trend will change, as the difference is now equivalent to about 100 leaf blocks.
So what is the ideal value to set sequence cache to avoid inter-instance leaf index block contention, yet minimizing possible gaps? One of the main variables to consider is the insert rate; the higher it is, the higher the sequence cache must be. However, other factors may also play a role, especially for workloads composed by a mix of DML and query activity, so results may vary widely for different systems. We have performed some tests involving only insert statements executed concurrently from 2 instances; some of the results are shown in the graph below:
Figure 1: Impact of larger sequence cache on throughput
The test compares of 4 processes concurrently inserting into the same table on a single instance against 4 processes running on 2 instances (2 processes per instance). Note that varying the sequence cache has hardly any impact on the single instance case, but has a major influence on throughput for RAC. Moreover, throughput on RAC was better than in single instance for sequence caches above 10000 due to the fact that local latch contention was limited by having 2 buffer caches (2 instances), while the sequence cache minimized inter-instance contention.
Although these results illustrate the potential benefits of increasing sequence caches, it should be mentioned that some workloads may not benefit as much from this technique, and the only way to determine if a particular application can benefit from it is through testing. As changing sequence caches have no impact on availability, it is relatively easy to experiment with different values.
3.5 PQ/PX
The Parallel Query (PQ) option was first introduced in the release of Oracle 7.1, which provides the capability of parallel full table scan. During the development cycle of Oracle8i and Oracle9i, Oracle extended and enhanced the Parallel Execution (PX) to include parallel DML on partitioned tables, parallel index creation, parallel recovery, parallel statistics collection for cost-based optimizer, etc. The parallel execution scales very well on a single large SMP node, and the Real Application Cluster (RAC) extends the scalability to multiple instances cross multiple nodes (inter-node parallel execution).
The Oracle9i Data Warehousing Guide, Chapter 21, has detailed coverage on how to use and tune parallel executions for Oracle9i. Most of the concepts and the techniques apply to both the single instance and multiple instances.
3.5.1 The Degree of Parallelism
The Degree Of Parallelism (DOP) used for a SQL statement is determined based on certain criteria, such as the type of statement (query, DML, DDL) and the objects involved (tables, indexes). For queries, it is usually the maximum DOP attribute of any of the tables involved. The maximum degree is used to scan all tables/indexes, e.g. if table A is degree 4 and table B is degree 6 and the query joins both, degree 6 will be used to scan both A and B. The number of parallel execution servers used for a statement of a given DOP depends on the operation. We have a two slave set model, so anything more complicated than a simple SELECT (e.g. JOIN) will require DOP*2 parallel execution servers.
3.5.2 Automatic tuning of parallel execution
Oracle recommends that you allow instances themselves to determine how they will distribute the workload for parallel operations across the available parallel server processes by setting the initialization parameter PARALLEL_AUTOMATIC_TUNING to TRUE. This will automatically adjust the parameters such as LARGE_POOL_SIZE, PARALLEL_MAX_SERVERS, PARALLEL_ADAPTIVE_MULTI_USER, etc. The adaptive multi-user feature adjusts the degree of parallelism based on user load. In a RAC environment, local parallel execution servers are given priority for parallel execution to reduce inter-node communications. However, if local resources are not enough on the local node, parallel execution servers from remote nodes will be used. In general, you can assume that Oracle will try to allocate “CPU_COUNT” parallel execution servers on the local node before requesting parallel execution servers from other nodes.
3.5.3 Inter-node parallel execution
Parallel Query
By default, all available RAC instances are considered for parallel execution. So please be aware of the potential case of running out of available parallel execution servers on any given instance. In case of an insufficient number of parallel execution servers, the respective SQL statement will be processed using a less number of parallel execution servers, or even processed serially.
Example:
- 2-node cluster
- parallel_max_servers = 4
- Instance A, and then instance B shortly after, submit the following query:
select /*+ PARALLEL(mytable,8) */ count(*) from mytable;
Here, the query on instance A will spawn 4 parallel execution servers on instance A, and 4 parallel execution servers on instance B. Since parallel_max_servers is set to 4, this leaves no available parallel execution server for the query submitted on instance B, and it will be processed serially. The query on instance A will finish quicker than the query on instance B. Note that parallel execution will cause the data being read in from disk via direct path directly into the PGA thus bypassing the buffer cache, whereas for serial execution, we read data into the buffer cache.
Parallel DML
Parallel DML works in the same manner. In the example below, 12 parallel execution servers were spawned on 3 instances for this UPDATE statement:
SQL> alter session enable parallel dml;
Session altered.
SQL> alter session set “_px_trace”=”all”;
Session altered.
SQL> update /*+ parallel(ware,12) */ ware set w_ytd=w_ytd;
From the trace file generated via _px_trace:
kxfpgsg
got 12 threads on 3 instance(s), avg height=4
Insts 1 2 3
Svrs 4 4 4
For more detail on allocation of parallel execution servers, see Appendix I: Allocation of parallel execution servers in RAC.
3.5.4 Using Parallel Instance Groups
If you want to limit the parallel execution within one instance or a group of instances, you should establish the group membership by setting the INSTANCE_GROUPS and the PARALLEL_INSTANCE_GROUP init.ora parameters. For example, you assign
- On instance A: INSTANCE_GROUPS=AMER
- On instance B: INSTANCE_GROUPS=AMER, EMEA, APAC, JPN
then, a user can activate the nodes in the AMER group to spawn query server processes by
ALTER SESSION SET PARALLEL_INSTANCE_GROUP = ‘AMER’;
In response, the parallel executions can be split across instances, namely can run on both instance A and instance B. On the other hand, if you set the PARALLEL_INSTANCE_GROUP = ‘APAC’, only instance B can be used for parallel executions.
The init.ora parameter INSTANCE_GROUPS cannot be changed dynamically.
The pair of parameters is useful when you partition your workload by function among different nodes.
3.5.5 Debugging and Tuning Parallel Execution
Setting the “_px_trace” parameter at session level can turn on the parallel execution tracing. For example, the following command enables time tracing for messages between parallel execution servers.
ALTER SESSION SET “_PX_TRACE” = “MEDIUM”, “MESSAGING”, “TIME”;
The options for verbosity level are “low”, “medium”, and “high”. The layer can be one or more of options in “scheduling”, “execution”, “granule”, “messaging”, “buffer”, “compilation”, “tq”, and “affinity”. If timing is desired, one can enable it by setting the “time” option. The following excerpt from the Oracle trace file shows the trace of a round trip message between two parallel execution servers on two different nodes. The timing unit is millisecond.
Example for a trace output:
kxfprialo [ 2941/ 0]
new qref=0x8099bd40 q=0x8099c138 qd=0x2b38c3c state=10000 msg=0×0
kxfpiqrcv [ 2942/ 1]
successfully posted a receive of buffer 809759f8, size 2148…
kxfprialo [ 2943/ 1]
new qref=0x8099bb80 q=0x8099c138 qd=0x2b38ca0 state=10000 msg=0×0
kxfpqrienq [ 2943/ 0]
sending message to server 0 on instance 2 – serial# 1537, sequence# 1
kxfpg1sg [ 2943/ 0]
wait reply from qref 0x8099bd40
kxfp_send_callback [ 2945/ 2]
buffer 80975194, qref 8099bb80
kxfpqiprq [ 2950/ 5]
got a regular message…
kxfpqrcv [ 2950/ 0]
received message from server 0 on instance 2 – serial# 1537, sequence# 2
msg=0x809759f8
It can oco a slave; consider increasing the init.ora parameter PARALLEL_EXECUTION_MESSAGE_SIZE to a higher value, e.g. 4k. On most platforms, the default value is 2048 bytes if PARALLEL_AUTOMATIC_TUNING is set to FALSE (default in Oracle9i), and 4096 bytes if set to TRUE. The default value is adequate for most applications. Larger values require a larger shared pool, but result in better performance at the cost of higher memory usage.
Note:
When running HMP protocol on HPUX, the message size is limited to 4k. When attempting to use a higher value than 4096 for PARALLEL_EXECUTION_MESSAGE_SIZE, the alert.log will show:
LICENSE_MAX_USERS = 0
SYS auditing is disabled
kxfpnfy: Adjusting parallel execution message size to 4096 bytes to conform to IPC OSD message size limit
Starting up ORACLE RDBMS Version: 9.2.0.1.0.
Please also refer to Appendix A: Critical Bugs List with regards to PQ related bugs in RAC.
3.6 Advanced Queuing / Replication
This section gives some hints on using Replication and Advanced Queuing with RAC. The following items are discussed:
- Database Jobs
- Replication
- Advanced Queuing
3.6.1 Database Jobs (DBMS_JOB)
The database job facility is frequently used in Replication environments, and is configured through the DBMS_JOB package.
If tables are accessed on only one of the nodes, it makes sense to use the instance affinity feature of the DBMS_JOB package. With this feature it is guaranteed, that only the specified instance will run the job. The instance affinity feature is configured with the INSTANCE parameter. The following procedures make use of this parameter:
- DBMS_JOB.SUBMIT
- DBMS_JOB.CHANGE
- DBMS_JOB.INSTANCE
- DBMS_JOB.USER_EXPORT
The instance assignment can be changed with the DBMS_JOB.INSTANCE procedure afterwards. The assignment to instances is not only interesting for Replication or Advanced Queuing, but also for other users of the database job facility, if most of the work is done on a particular instance.
It is worth caching the sequence that is used by the job queue facility.
ALTER SEQUENCE sys.jobseq$ CACHE 100;
3.6.2 Replication
In the Replication module, we differentiate between three different ways to replicate the data: read only snapshots, updateable snapshots and master-master replication.
Read Only Replication
This part of the discussion focuses on read only snapshots. Most of it can also be applied for updateable snapshots. The following points are discussed:
- Automatic segment space management
- Instance affinity for DBMS_JOB
If Read-Only Snapshots are used, it is important to use locally managed tablespaces. In Oracle9i, tablespaces are automatically created as locally extent managed. We recommend these tablespaces to be set up with automatic segment space management, because by creating the tables and the snapshot logs as bitmap segments, will eliminate the need to create freelists and freelist groups.
CREATE TABLESPACE <tablespace name>
DATAFILE <file spec>
EXTENT MANAGEMENT LOCAL
SEGMENT SPACE MANAGEMENT AUTO
It is advisable to create the snapshot log in a dedicated tablespace:
CREATE SNAPSHOT LOG ON <schema>.<table name>
TABLESPACE <tablespace name>
Multi-Master replication
Before multi-master replication can be used, the script CATREP.SQL has to be run. This script creates several tables and procedures in the SYS schema and system tablespace by default. Performance will improve if these objects are created in their own tablespace with automatic segment space management enabled. If the objects are already created, you can follow the move procedure that is described in WebIV Note# 1037317.6.
It is important that the tables that are part of a replication group are created in a bitmap segment tablespace. Additionally, cache the sequences SYS.IDGEN1$ and SYSTEM.REPCAT_LOG_SEQUENCE with a higher value.
While using Advanced Replication the following tables need to be kept in the shared pool, using the DBMS_SHARED_POOL.KEEP() procedure.
- DBMS_REPCAT
- DBMS_REPCAT_RPC
- DBMS_DEFER_SYS
- DBMS_JOB
- <TableName>$RP
3.6.3 Advanced Queuing
The Advanced Queuing Option and Replication are very similar. Part of the recommendations that were given for replication, also apply for Advanced Queuing:
- Automatic Space management
- Instance Affinity
Automatic segment space management should be implemented for the tablespaces where the queue tables will be created. An example of how to create these tablespaces is shown in the section on Replication.
With Advanced Queuing, there is one scenario where it makes sense to specify on which node the queue tables are to be created. If the queue tables enqueue and dequeue messages on the same instance, the instance should be specified while these tables are created.
Advanced Queuing is configured and administered through PL/SQL packages. The main package is DBMS_AQADM; it offers procedures where it is possible to assign the work to a particular instance:
- DBMS_AQADM.CREATE_QUEUE_TABLE
- DBMS_AQADM.ALTER_QUEUE_TABLE
3.7 Recovery
This chapter
- Briefly lists the recovery steps in RAC,
- Illustrates the dependency / independency of the number of dirty blocks and number of available and failed instances on instance recovery time,
- Explains how recovery times can be tuned.
3.7.1 Instance recovery steps
Instance recovery encompasses the following steps:
- SKGXN (Node Monitor) synchronization
- Cluster Group Service Membership / LMON reconfiguration
- Global Resource Directory reconfiguration
- First pass log read
- Recovery Lock Claiming, and Recovery Reads
- Second Pass Log Read and Redo Application
The overall recovery time is dominated, even in a single-node environment, by the time spent in steps 5 and 6, i.e. by the time to read the recovery blocks from disk, the time to claim recovery locks for each of those blocks (instance recovery only), and by the time to apply the redo.
3.7.2 Instance recovery cost
In general, recovery cost is proportional to the number of blocks needing recovery, and not the number of available or failed instances. Let us illustrate this with some fictitious scenarios.
Scenario A:
- 2-node cluster, instance 1 has N dirty buffers, and crashes. Instance 2 needs to recover N blocks.
- 4-node cluster, instances 1,2,3 together have a total of N distinct dirty buffers. Instances 1,2,3 crash, and instance 4 needs to recover N blocks.
In both cases, there are N dirty buffers that need recovery, so you can expect approximately the same time for instance recovery.
Scenario B:
- 2-node cluster, instance 1 has N dirty buffers, and crashes. Instance 2 needs to recover N blocks.
- 4-node cluster, instances 1,2,3 have a total of N distinct dirty buffers. Only instance 1 crashes; instance 2 will commence recovery for N/3 blocks.
Here you can expect that recovery is faster in the 4-node case since there are fewer blocks to recover.
Scenario C:
- 2-node cluster, instance 1 has N dirty buffers, and crashes. Instance 2 needs to recover N blocks.
- 4-node cluster, instances 1,2,3 EACH has N distinct dirty buffers, so total number of dirty buffers is 3*N. Instances 1,2,3 crase 4 needs to recover 3*N blocks.Since recovery reads and redo application are the biggest contributors to recovery, you can expect that these two steps take roughly three times longer in this 4-node case.
3.7.3 Recovery tuning
Guaranteeing that the cache remains fairly clean can therefore drastically reduce recovery times. This can be proactively achieved by employing incremental checkpointing, a mechanism that frequently writes dirty buffers out to disk:
- Set the init.ora parameter FAST_START_MTTR_TARGET to a non-zero value in seconds
This will guarantee that instance or crash recovery will complete within this desired time frame. The default setting for FAST_START_MTTR_TARGET is 0, i.e. incremental checkpointing is disabled. When a new database is created via DBCA, Oracle will automatically add FAST_START_MTTR_TARGET with a value of 300 (i.e. 5 minutes) to the initialization parameter file.
In addition, the following mechanisms will proactively and reactively help in reducing recovery times even further:
- Use of Async I/O
This is one of the most crucial factors in recovery timing. Even with larger and fairly dirty caches, recovery reads can complete rather quickly since recovery reads are always attempted in asynchronous mode if Async I/O is supported on your platform, and if the I/O subsystem can handle the amount of I/O. In a recent recovery situation, the synchronous recovery reads of 180000 blocks took around 20 minutes; with asynchronous I/O, this could have completed in less than 1 minute. The first-pass log read also uses async read-ahead.
Important note:
It needs to be mentioned that the performance of the claiming phase may degrade for a larger number of asynchronous recovery reads due to the way we currently handle I/O request completions in the port-specific Oracle I/O layer on HPUX and AIX. This issue is being tracked as bugs 2505200 and 2515238; please also refer to Appendix A: Critical Bugs List. For Tru64, please increase the kernel parameter aio_task_max_num (see 9.3.4 OS configuration, Other OS parameters). This problem can be avoided by guaranteeing that the buffer cache remains fairly clean, and crash or instance recovery won’t have to handle a larger recovery set.
- Use of parallel execution servers by setting the init.ora parameter recovery_parallelism > 0
Parallel execution servers will perform steps 5 and 6 in parallel. This is in particular useful if data block I/O or CPU is the bottleneck.
If recovery_parallelism is set to 0 (default), then SMON will perform these steps when doing instance recovery; in crash recovery, the foreground process will perform these steps.
- Instance recovery only uses 50% of the available default buffer cache (where the default buffer cache = buffer cache – RECYCLE pool)[3] for recovery buffers. If instance recovery does not find enough space for all the recovery buffers, i.e. the available buffer cache is too small, then steps 5 and 6 will be performed in several passes. This should be avoided since the blackout period does not end until the last subset of buffers requiring recovery is read into the cache. When this spillover situation occurs, it will be recorded in the alert.log and SMON trace file of the recovering instance.
Excerpt from alert.log:
Fri May 31 09:29:19 2002
Started first pass scan
Fri May 31 09:29:19 2002
Completed first pass scan
2958 redo blocks read, 8718 data blocks need recovery
Fri May 31 09:29:21 2002
Started recovery at
Thread 1: logseq 12, block 3, scn 0.0
Recovery of Online Redo Log: Thread 1 Group 2 Seq 12 Reading mem 0
Mem# 0 errs 0: /dev/rvol/oracle_dg/pommi_log1_2
Fri May 31 09:29:22 2002
Recovery set larger than buffer cache: start next pass!
Recovery of Online Redo Log: Thread 1 Group 2 Seq 12 Reading mem 0
Mem# 0 errs 0: /dev/rvol/oracle_dg/pommi_log1_2
Fri May 31 09:29:25 2002
Ended recovery at
Thread 1: logseq 12, block 2961, scn 0.5075528
8693 data blocks read, 8718 data blocks written, 2958 redo blocks read
Ending instance recovery of 1 threads
Excerpt from SMON trace file:
*** 2002-05-31 09:29:21.685
KCRA: buffers claimed = 4770/8693, eliminated = 25
KCRA: spillover occurred
*** 2002-05-31 09:29:22.664
KCRA: start recovery claims for 3923 data blocks
*** 2002-05-31 09:29:24.003
KCRA: buffers claimed = 3923/3923, eliminated = 0
As it can be seen in this example, recovery needed two passes to complete; first to recover 4770 blocks, and in the second and last pass, the remaining 3923 blocks.
The above example was performed in a test environment; in a production system with a larger number of blocks to recover and more redo logs to read, recovery times may increase drastically if several recovery passes have to be performed.
Using incremental checkpointing can therefore also reduce the likelihood of this spillover situation, i.e. setting FAST_START_MTTR_TARGET to a non-zero value. Or if memory is available, increase the size of the default buffer pool.
3.8 PL/SQL native compilation
PL/SQL native compilation is a new Oracle9i feature. By enabling this feature, PL/SQL objects are no longer saved in the data dictionary; instead, when you compile your packages, a shared library is created in the local filesystem. The newly created library cannot be copied to another node’s local filesystem since Oracle validates the file by looking at the file header; so when trying to access the package from another instance, we will find the shared library not in sync with what we expect. For this reason, you have to use either a cluster file system or use NFS. However, since NFS is not considered very reliable, it is recommended to use this feature in a RAC environment only if a cluster file system is available. And only if the number of packages using PL/SQL native compilation is relatively small and does not undergo frequent changes, it makes sense to use this feature: in e.g. an Oracle Applications environment with 40000+ packages, and where base object may change frequently due to patch application, this is just no feasible due to e.g. filesystem limits, down-time limits, unless the number of eligible objects can be reduced to a reasonable value.
Using PL/SQL native compilation has a negative effect on compilation times: it doubles. This increase in compilation time occurs in both single instance and RAC environments.
3.9 Statspack
It is essential to have baseline performance data available. We recommend the usage of Statspack on all the nodes in a RAC environment. In order to use Statspack, it is necessary to set up a new user (PERFSTAT), by running SPCREATE.SQL. It is important that the default tablespace for PERFSTAT is a locally managed tablespace with automatic segment space management; otherwise you may see a lot of pinging as no freelist groups are used by default.
The snapshots should be taken at the same time. Otherwise it may not be possible to get a complete performance picture of the whole cluster.
With Oracle9i Release 2, object level statistics are available with a snapshot level => 7. Please also refer to 4.5 V$SEGMENT_STATISTICS.
Care should be taken when examining the wait events presented in Statspack. There are wait events that do not contribute to the response time perceived by users; Statspack keeps track of those in ‘STATS$IDLE_EVENT’ so that they are automatically eliminated from the reports. In Oracle9i Release 1, the RAC idle event[4] “ges remote message” needs to be inserted into PERFSTAT.STATS$IDLE_EVENT, so that it is not listed in Statspack reports. There are other non-RAC idle events that may need similar treatment, like “PL/SQL lock timer”, for example.
It is also advised to frequently analyze a fast growing PERFSTAT schema in order to ensure proper execution plans for Statspack queries during snapshots.
3.10 SCN propagation
To implement read consistency functionality in a RAC system, SCNs (System Change Numbers) need to be propagated throughout the cluster. The two main algorithms utilized to propagate SCNs are:
- Broadcast-On-Commit:
At every commit, the most recent local SCN will be broadcasted. Read consistency is fully guaranteed for any kind of workload, since every change in the local SCN is immediately propagated.
- Lamport:
The SCN is sent along with every GCS message.
The Lamport implementation is more efficient because no additional messages need to be sent. Even when GCS messages are very infrequent, SCN’s will be propagated by LCK0 every 3 seconds. In very rare cases, for some applications, Lamport may not be fast enough, and recently committed data may not be visible yet on other instances. That can happen, for example, when there is a pool of connections spanning multiple instances, so that a client process could potentially commit a transaction on instance 1 and immediately query the changed data on instance 2. In those cases, Broadcast-on-commit will be a better choice for SCN propagation. Applications where a client process always remains connected to one instance (such as Oracle Applications) will have no problems running with Lamport.
The performance overhead associated with Broadcast-On-Commit may vary for different workloads and different platforms. It depends on the frequency of commits, the number of nodes in the cluster, and the interconnect protocol. Broadcast-On-Commit was used in a 4-node SAP benchmark on HP Tru64 (RDG IPC protocol) with excellent scalability results. On the other hand, Lamport outperformed Broadcast-on-Commit by up to 20% in a smaller scale benchmark on Sun using UDP interconnect protocol.
SCN propagation algorithms are controlled by the parameter MAX_COMMIT_PROPAGATION_DELAY.
3.10.1 MAX_COMMIT_PROPAGATION_DELAY
The parameter MAX_COMMIT_PROPAGATION_DELAY (MCPD) is a number in hundredths of a second that used to specify the maximum delay between SCN propagations. However, as described above, LCK propagates the SCN every 3 seconds, as of Oracle 7.3.4. Because of that, the parameter lost its meaning as an upper-bound amount of time. The following are the most useful settings for MCPD:
- MCPD < 100 – Broadcast-On-Commit:
-
- Any value below 1 second will cause the Broadcast-On-Commit SCN scheme to be picked.
-
- On HP Tru64, MCPD=0 triggers a different, sub-optimal implementation of Broadcastcessive CPU consumption; this value happens to be the default on Tru64, hence should be changed from its default for better performance.
-
- We recommend any value between 1 and 99 to be used when the application requires Broadcast-On-Commit.
- MCPD >= 700 – Lamport:
-
- Setting MCPD to 700 (7 seconds) or above will cause the Lamport SCN scheme to be picked.
-
- 700 is the default value on all platforms except Tru64, where the default is 0.
-
- 700 or above is the recommended setting for most applications.
- Values between 100 and 699 are not recommended.
3.10.2 Messages in the alert.log
The SCN generation scheme is seen in the alert.log just before opening the database, for example:
- “Picked Lamport scheme to generate SCNs”, or
- “Picked broadcast on commit scheme to generate SCNs”
On Tru64, “Picked Lamport Server scheme to generate SCNs” is reported for MCPD < 700; this is actually a misnomer since it is not a Lamport implementation, and Broadcast-On-Commit is used.
The alert.log also shows the message “SCN scheme {1,2,3}” during startup. This scheme is not related to MCPD setting and/or SCN propagation in RAC, but rather refers to platform dependent SCN generation algorithms even for single instance. The values {1,2,3} denote the following:
- Latch-free SCN writes with 64-bit Compare-And-Swap (CAS) (e.g. HPUX 11i)
- Latch-free SCN writes with 32-bit CAS (e.g. Solaris 32bit)
- Latch-free SCN reads, if CAS not supported or latch-free SCN writes with 32/64 bit CAS disabled (e.g. Tru64)
[1] See chapter 6.3.2 for a description of the init.ora parameter FAST_START_MTTR_TARTGET
[2] TX enqueue contention is normally a sign of application contention, unless caused by ITL waits (not enough space in the block) or service ITL waits (index block splits). The TX enqueue hold times are affected by increased buffer access times. Therefore, buffer busy (global cache) waits and enqueue waits (TX) frequently occur as dependent pair in the top 5 wait events.
[3] By applying the backport for bug 2563162, the KEEP pool can also be used for recovery.
[4] Other RAC idle events are ‘gcs remote message’ (Oracle9iR2) and ‘gcs for action’ (Oracle9iR1).
© 2010, www.oracledatabase12g.com. 版权所有.文章允许转载,但必须以链接方式注明源地址,否则追究法律责任.
相关文章 | Related posts:
- Collecting Diagnostic information for Oracle block corruption
- TROUBLESHOOTING: Tuning Queries That Cannot be Modified
- 10.2.0.5 补丁集发布
- Oracle Block Cache Layer And Its Corresponding Corruption Error Messages.
- ORA-00600: INTERNAL ERROR CODE, ARGUMENTS: [729], [10992], [SPACE LEAK] Example
- ORA-600 internal error[kqrfrpo]一例
- TECH: Database Block Checking Features
- 参数cluster_interconnect详细介绍
- Patch set update 11.2.0.1.1 Readme quote
- SQL*Net PERFORMANCE TUNING UTILIZING UNDERLYING NETWORK PROTOCOL




最新评论