Applies to:
Oracle Server – Enterprise Edition – Version: 8.0.3.0 to 11.2.0.2 – Release: 8.0.3 to 11.2
Information in this document applies to any platform.
Oracle Server Enterprise Edition – Version: 8.0.3.0 to 11.2.0.1.0
Purpose
To provide some practical hints on how to deal with the effects of moving to an AL32UTF8 database character set and using Unicode clients.
Scope and Application
This note ignores any difference between AL32UTF8 and UTF8 and uses AL32UTF8, all information in this note is however the same for UTF8.
Choosing between UTF8 and AL32UTF8 (in 9i and up) is discussed here: Note 237593.1 Problems connecting to AL32UTF8 databases from older versions (8i and lower)
Basically, if your setup is completely (!) (all clients and all servers) 9i or higher, use AL32UTF8 as NLS_CHARACTERSET (unless there are restrictions posed by the Application layer/vendor like for example Oracle Applications lower then Version 12).
If there are older 8i or lower clients use UTF8 and not AL32UTF8 as NLS_CHARACTERSET.
It will provoke data corruption unless you applied Patch 5874989 on the Impdp side, Expdp is not affected. The “old” exp/imp tools are not affected. This problem is fixed in the 10.2.0.4 and 11.1.0.7 patch set.
For windows the fix is included in
10.1.0.5.0 Patch 20 (10.1.0.5.20P) or later, see Note 276548.1
10.2.0.3.0 Patch 11 (10.2.0.3.11P) or later, see Note 342443.1
AL32UTF8 / UTF8 (Unicode) Database Character Set Implications
A) Often asked questions:
A.1) Do I need to use Nchar, Nvarchar2 or Nclob?
People often think that data types like NCHAR, NVARCHAR2 or NCLOB (NLS_NCHAR_CHARACTERSET / National Character set datatypes) need to be used to have UNICODE support in Oracle.
This is simply not true.
The NLS_NCHAR_CHARACTERSET (used for NCHAR, NVARCHAR2 and NCLOB columns) is in 9i and up always Unicode (see Note 276914.1 The National Character Set in Oracle 9i and 10g ) but “normal” CHAR, VARCHAR2 , LONG and CLOB columns can be used for storing Unicode. In that case an AL32UTF8 NLS_CHARACTERSET database is needed.
SQL> select value from NLS_DATABASE_PARAMETERS where parameter='NLS_CHARACTERSET';
VALUE
----------------------------------------
AL32UTF8
And then all “normal” CHAR, VARCHAR2 , LONG and CLOB datatypes are “Unicode”.
A.2) Does my Operating System need to support Unicode?
For an Unicode database Oracle does not need “Unicode support” from the OS where the database is running on because the Oracle AL32UTF8 implementation is not depending on OS features.
It’s for example perfectly possible to run/use an AL32UTF8 database on an Unix system that has not installed any UTF-8 locale. It’s however advisable to configure you OS to use UTF-8 so that you can use this environmet as UTF-8 *client*.
There is also no need to “install Unicode” or so for the Oracle database software, all chartersets known in a database version , and this includes Unicode charactersets, are always installed. You simply cannot choose to not install them.
A.3) What are the Unicode charactersets and the Unicode versions in Oracle RDBMS?
For information on the Unicode character sets in Oracle and the versions of Unicode supported please see: Note 260893.1 Unicode character sets in the Oracle database
A.4) Is -insert your preferred language here- supported in an Oracle Unicode database?
The short answer, when using AL32UTF8, is “yes”.
For some languages like HKCSC2004 UTF8 may not be ideal. If you want to be 100% sure check the Unicode version of the Oracle release and then have a look at http://www.unicode.org or Note 1051824.6 What languages are supported in an Unicode (UTF8/AL32UTF8) database?
Most likely it’s a bigger question if the client environment can support the language in question then an AL32UTF8 database.
A.5) I also want to upgrade to a new Oracle version, do I go to AL32UTF8 before or after the upgrade?
If your current Oracle version is 8.1.7 or lower then it’s best to upgrade first to a higher release, mainly because a) you then can use AL32UTF8 (not possible in 8i) and b) Csscan has a few issues in 817 who might provoke confusion.
If your current Oracle version is 9i or up then both (before or after) are a good choice, it simply depends on your preference or needed application changes. We would however advice to not do the upgrade and the characterset change at the same time, simply to be able to trace issues who might arise to or the upgrade or the characterset change. Since doing an upgrade or characterset change NEEDS proper testing and Q&A this is of course less relevant for production systems, the changes are then already well tested.
B) Server side implications.
B.1) Storage.
AL32UTF8 is a varying width characterset, which means that the code for 1 character can be 1 , 2 , 3 or 4 bytes long. This is a big difference with character sets like WE8ISO8559P1 or WE8MSWIN1252 where 1 character is always 1 byte.
US7ASCII characters (A-Z,a-Z,0-1 and ./?,*# etc..) are in AL32UTF8 always 1 byte, so for most West European languages the impact is rather limited as only “special” characters like ç, ñ , é will use more bytes then in a 8 bit characterset. When converting a Cyrillic or Arabic system to AL32UTF8 then all the Cyrillic or Arabian data will take considerable more bytes to store.
This also means that the columns need to be big enough to store the additional bytes. By default the column size is defined in BYTES and not in CHARACTERS. By default a “create table <name> (<colname> VARCHAR2 (2000));” means that that column can store 2000 bytes.
From 9i onwards it’s possible to define the column length with the number of characters you want to store, regardless of the characterset. How this works, what the limits and current known problems are is explained in Note 144808.1 Examples and limits of BYTE and CHAR semantics usage
More info on how AL32UTF8 encoding works can be found in Note 69518.1 Storing and Checking Character Codepoints in a UTF8/AL32UTF8 (Unicode) database.
B.2) How much will my database grow when going to AL32UTF8?
The biggest expansion will be seen with CLOB’s, if the source database is a 8 bit characterset (WE8ISO8859P1, WE8MSWIN1252 etc) then populated Clob columns will double in disksize.Note 257772.1 CLOBs and NCLOBs character set storage in Oracle Release 8i, 9i and 10g
An Estimation of the expansion is listed in the Csscan .txt file output under the Expansion header. See Note 444701.1 Csscan output explained
We advice to use Csscan always when going to AL32UTF8, see point B.5).
B.3) Codepoints for characters may change in AL32UTF8.
There is a common misconception that a character is always the same code, for example the £ sign is often referred as “code 163″ character. This is not correct, a character is a certain code only in a certain characterset (!). The code itself means nothing if you do not know what characterset you are using.
The difference may look small, but it’s not.
The pound sign for example is indeed “code 163″ ( A3 in hex) in the WE8ISO8859P1 and WE8MSWIN1252 charactersets, but in AL32UTF8 the pound sign is code 49827 (C2 A3 in hex).
When using chr(163) in a AL32UTF8 database the 163 code is a illegal character, as 163 simply does not exist in UTF8, the pound sign is chr(49827) in an UTF8/AL32UTF8 system.
So be careful when using for example the CHR(<code>) function, the code for a character depends on the database characterset!
Instead of CHR() it’s far better to use Unistr(‘\<Unicode codepoint>’). Unistr() (a 9i new feature) works always on every characterset that knows the character. There is for example no need to change the Unistr value for the Euro symbol when changing from WE8MSWIN1252 to AL32UTF8.
For more info on how to check/find the code for a character in AL32UTF8 and using Unistr please see Note 69518.1 Storing and Checking Character Codepoints in an UTF8/AL32UTF8 (Unicode) database
Only US7ASCII (A-Z,a-z,0-9) characters have the same codepoints in AL32UTF8 as in US7ASCII, WE8ISO8859P1, AR8MSWIN1256 etc. meaning that using chr() for any value above 128 should be best avoided.
B.4) The meaning of SP2-0784, ORA-29275 and ORA-600 [kole_t2u], [34] errors / losing characters when using convert.
If you receive errors like SP2-0784: Invalid or incomplete character beginning 0xC4 returned or ORA-29275: partial multibyte character or ORA-600 [kole_t2u], [34] then this means that you are storing data in a character datatype that is NOT using AL32UTF8 encoding.
As can seen in Note 69518.1 Storing and Checking Character Codepoints in an UTF8/AL32UTF8 (Unicode) database an UTF8 code sequence cannot start with C4 for example.
The ORA-29275: partial multibyte character error is a result of Oracle is doing “sanity” checks on character strings on very low level to see if the code sequence is valid AL32UTF8, the checks are (for performance reasons) not yet 100%, so some illegal code sequences may not be detected, but they are enhanced in every version, the checks in 11g are far better/stricter then in 9i for example. This is done to avoid wrong result sets from functions and to reduce the risk of injection problems leading to security problems.
With Clob data you will not encounter ORA-29275 but ORA-600 [kole_t2u], [34]. See Note 734474.1 ORA-600 [kole_t2u], [34] – description, bugs, and reasons
SP2-0784 is a pure client side error/warning returned by sqlplus, it means the same as ORA-29275.
Any character datatype like CHAR, VARCHAR2 and CLOB expect the data to be in the encoding defined by the NLS_CHARACTERSET. Storing data in an encoding that is not the NLS_CHARACTERSET is not supported. Any data using a encoding different from the NLS_CHARACTERSET should be considered BINARY and a BINARY datatypes like RAW or BLOB should be used to store and process (!) this. Most of the time this is seen with “encrypted” (passwords etc) data stored in a VARCHAR2. See also point B.10) in this note.
This is also true when using the “convert” function by the way, any conversion from the NLS_CHARACTERSET to a other characterset should be considered as binary data as the result is not in the NLS_CHARACTERSET. When using the convert function you might even see errors like ORA-12703: this character set conversion is not supported. Note that this is expected behavior and that the convert function should not be used in normal application logic.
There is only one solution and that is to use CHARACTER datatypes for what they are designed for, store data in the NLS_CHARACTERSET. If you want to write out files in an other characterset then AL32UT8F then use UTL_FILE, see point B.8)
To find all *stored* data that might give these errors Csscan can be used.
install this first
Note 458122.1 Installing and configuring CSSCAN in 8i and 9i
Note 745809.1 Installing and configuring CSSCAN in 10g and 11g
The run Csscan with this syntax:
$ csscan \"sys/<syspassword>@<TNSalias> as sysdba\" FULL=Y FROMCHAR=<current NLS_CHARACTERSET> TOCHAR=<current NLS_CHARACTERSET> LOG=dbcheck CAPTURE=N ARRAY=1000000 PROCESS=2
* The <current NLS_CHARACTERSET> is seen in NLS_DATABASE_PARAMETERS.
select value from NLS_DATABASE_PARAMETERS where parameter=’NLS_CHARACTERSET’;
* the TOCHAR=<current NLS_CHARACTERSET> is not a typo, the idea is to check the CURRENT charterset for codes who are not defined in this NLS_CHARACTERSET
Any “Lossy” is data that will give ORA-29275 and/or ORA-600 [kole_t2u], [34]
To have an overview of the Csscan output and what it means please see Note 444701.1 Csscan output explained
Known Oracle Bugs who can give ORA-29275:
Bug 6268409 ORA-29275 ERROR WHEN QUERYING THE SQL_REDO/UNDO COLUNMS IN V$LOGMNR_CONTENTS
Fixed in 10.2.0.5 , 11.1.0.7 and up
Bug 5915741 ORA-29275 selecting from V$SESSION with Multibyte DB
Fixed in 10.2.0.5 , 11.1.0.6 and up
B.5) Going to AL32UTF8 from another characterset.
To change a database NLS_CHARACTERSET to AL32UTF8 we suggest to follow Note 260192.1 Changing the NLS_CHARACTERSET to AL32UTF8/UTF8 (Unicode)
Please note that it’s strongly recommend to follow that note when going to AL32UTF8 in any case, even when using (full) exp/imp.
Using Csalter / Alter Database Character Set when going to (AL32)UTF8 can be considerable faster then full exp/imp since you do not need to export all “US7ASCII” data (as all 0-9, A-Z and a-z character data will stay the same).
Please see the following note for an Oracle Applications database: Note 124721.1 Migrating an Applications Installation to a New Character Set.
This is the only way supported by Oracle applications. If you have any doubt log an Oracle Applications SR for assistance.
B.6) ORA-01401 / ORA-12899 while importing data in an AL32UTF8 database.
If import give errors like
IMP-00019: row rejected due to ORACLE error 1401
IMP-00003: ORACLE error 1401 encountered
ORA-01401: inserted value too large for columnor from 10g onwards:
ORA-02374: conversion error loading table "TEST"."NTEST"
ORA-12899: value too large for column COMMENT (actual: 6028, maximum: 4000)then this indicates that the columns cannot handle the “grow in bytes” of the data. This is solved by enlarging the column or using CHAR semantics. Please see point B.1) of this note.
One known issue is documented in Note 779526.1 CSSCAN does not detect data truncation for CHAR datatype – ORA-12899 when importing
B.6) Object and user names using non-US7ASCII characters.
Names can be from 1 to 30 bytes long with these exceptions:
* Names of databases are limited to 8 bytes.
* Names of database links can be as long as 128 bytes.
* Identifiers of database objects are max 30 bytes long.
http://download-uk.oracle.com/docs/cd/B19306_01/server.102/b14200/sql_elements008.htm#i27570
This select
SQL> select object_name from dba_objects where object_name <> convert(object_name,'US7ASCII');will return all objects having a non-US7ASCII name.
Using CHAR semantics is not supported in the SYS schema and that’s where the database object and user name is stored. If there are column names, schema objects or comments with non-US7ASCII names that take more then 30 bytes in AL32UTF8 there is no alternative besides renaming the affected objects or user to use a name that will occupy maximum 30 bytes.
* An username can be max 30 bytes long
http://download-uk.oracle.com/docs/cd/B19306_01/server.102/b14200/statements_8003.htm#SQLRF01503
“..his name can contain only characters from your database character set and must follow the rules described in the section “Schema Object Naming Rules”. Oracle recommends that the user name contain at least one single-byte character regardless of whether the database character set also contains multibyte characters.”
This select
SQL> select username from dba_users where username <> convert(username,'US7ASCII');will return all users having a non-US7ASCII name.
B.6) The password of an user can only contain single byte data in 10g and below.
http://download-uk.oracle.com/docs/cd/B19306_01/server.102/b14200/statements_8003.htm#SQLRF01503
Passwords can contain only single-byte characters from the database character set regardless of whether the character set also contains multibyte characters. This means that in an AL32UTF8 database the user password can only contain US7ASCII characters as this are the only single byte characters in AL32UTF8.
This may provoke a problem, if you migrate from (for example) a CL8MSWIN1251 database then your users can use Cyrillic in their passwords, in CL8MSWIN1251 Cyrillic letters are one single byte, in AL32UTF8 they’re not.
This restriction is lifted in 11g, there multibyte characters can be used as password string. Please note that they need to be updated in 11g before they use the new 11g hashing system. Please see Note 429465.1 11g R1 New Feature Case Sensitive Passwords and Strong User Authentication
B.7) When using DBMS_LOB.LOADFROMFILE
When using DBMS_LOB.LOADFROMFILE then please read Note 267356.1 Character set conversion when using DBMS_LOB.
B.8) When using UTL_FILE
When using UTL_FILE then please read Note 227531.1 Character set conversion when using UTL_FILE.
B.9) When using sqlldr or external tables
When using Sqlldr or external tables make sure to define the correct characterset of the file in the control file. The characterset of the database has in no direct relation with the encoding used in the file, in other words, it’s not because the database is using an AL32UTF8 characterset that using AL32UTF8 as NLS_LANG or as characterset in the control file is always correct. You need to specify the encoding of the file sqlldr is loading.
Please read Note 227330.1 Character Sets & Conversion – Frequently Asked Questions
18. What is the best way to load non-US7ASCII characters using SQL*Loader or External Tables?
B.10) Make sure you do not store “binary” (Encrypted) data in character datatypes (CHAR,VARCHAR2,CLOB).
If binary data is stored/handled as a CHAR, VARCHAR2 or CLOB datatype then data loss is expected, especially when using an AL32UTF8 database (even without using exp/imp). Or errors like ORA-29275 or ORA-600 [kole_t2u], [34] may appear.
This is not a bug, handling “binary” data using character datatypes is simply not supported. There is no “workaround” besides implementing a real solution.
This typically most visible on custom application “password hash” functions. Any encrypted/hash result data should be considered “binary” data as its endresult usually will contain codes (or code sequences) who are not valid in the current NLS_CHARACTERSET. Any character datatype like CHAR, VARCHAR2 and CLOB expect the data to be in the encoding defined by the NLS_CHARACTERSET. Storing data in an encoding that is not the NLS_CHARACTERSET is not supported. Any data using a encoding different from the NLS_CHARACTERSET should be considered BINARY and a BINARY datatypes like RAW or BLOB should be used to store and process (!) this.
The only solution is to use a or a binary datatype (RAW, LONG RAW, BLOB) for the encrypted data or convert the ‘raw’ encrypted data from a binary datatype to hex notation (rawtohex) or BASE64 (utl_encode.base64_encode) encoding before storing / using it in an character datatype, making it suitable to store it in a character datatype.
For more info on storing encrypted/hashed data please see:
Note 270919.1 Transferring Encrypted Data from one Database to Another
Note 580346.1 Store Encrypted Data into a VARCHAR2 Column Using DBMS_CRYPTO
Note 760247.1 How to Use Correctly DBMS_CRYPTO.HASH With Clob and Raw Input?
Here is a simple example of a serverside generated “binary” string (a MD5 hash function of a text input string) converted to a HEX string, which allows the endresult to be stored in and used with CHAR, VARCHAR2 or CLOB datatypes.
create or replace FUNCTION MD5_sample(l_string VARCHAR2) RETURN VARCHAR2
AS
l_MD5_stringHEX VARCHAR2(50 BYTE);
l_MD5_stringR RAW(50);
l_stringR RAW(4000);
BEGIN
-- using dbms_crypto is the best way.
l_stringR := utl_raw.cast_to_raw(l_string);
l_MD5_stringR := dbms_crypto.hash(typ => dbms_crypto.HASH_MD5, src => l_stringR);
-- dbms_crypto is not known in versions lower then 10.2
-- in that case you can also use dbms_obfuscation_toolkit, but do this only if this
-- is really needed
-- l_MD5_stringR := utl_raw.cast_to_raw(dbms_obfuscation_toolkit.md5( input_string => l_string ));
l_MD5_stringHEX := rawtohex(l_MD5_stringR);
RETURN l_MD5_stringHEX;
END;
/
Again, if for example this MD5 sum was made by the application code then the result needs to be converted to HEX/BASE64 notation before it can be submitted to the Oracle client as an character datatype.
B.11) String functions work with characters not byte (length,like,substr …).
Functions like Length and Substr count in characters, not bytes. So in an AL32UTF8 database the result of Lenght or substr will be different then the amount of BYTES this string uses.
Functions like Substr and Length, who are often used to prepare or limit a string this means application logic, NEED to be checked.
Note that substrB might generate a different result then expected in a UTF8 env:
-- the euro symbol ( U+20AC) is 3 bytes in UTF8
-- a ( U+0061) is one byte
SQL> select dump(to_char(UNISTR('\0061\20AC\0061')),1016) from dual;
DUMP(TO_CHAR(UNISTR('\0061\20AC\0061')),1016)
-------------------------------------------------
Typ=1 Len=5 CharacterSet=AL32UTF8: 61,e2,82,ac,61
SQL> select dump(substrB(to_char(UNISTR('\0061\20AC\0061')),1,4),1016) from dual;
DUMP(SUBSTRB(TO_CHAR(UNISTR('\0061\20AC\0061')
----------------------------------------------
Typ=1 Len=4 CharacterSet=AL32UTF8: 61,e2,82,ac
SQL> select dump(substrB(to_char(UNISTR('\0061\20AC\0061')),1,3),1016) from dual;
DUMP(SUBSTRB(TO_CHAR(UNISTR('\0061\20AC\006
-------------------------------------------
Typ=1 Len=3 CharacterSet=AL32UTF8: 61,20,20
SQL> select dump(substrB(to_char(UNISTR('\0061\20AC\0061')),1,2),1016) from dual;
DUMP(SUBSTRB(TO_CHAR(UNISTR('\0061\20AC\
----------------------------------------
Typ=1 Len=2 CharacterSet=AL32UTF8: 61,20
The point here is that SubstrB will replace the codes sequences that are illegal in AL32UTF8 with spaces (=code 20). For example “e2,82,ac” is the Euro in AL32UTF8 encoding, but if you strip the first byte you have “82,ac” which is an illegal code sequence in AL32UTF8, hence the “82,ac” is replaced with “20,20″, which is legal (= 2 times a space), otherwise SubstrB would generate strings who can provoke an ORA-29275.
Also NLS_INITCAP , NLS_LOWER and NLS_UPPER may return multibyte characters, hence the return string may be longer in bytes than the input string.
B.12) LPad and Rpad count in “display units” not characters.
http://download.oracle.com/docs/cd/B28359_01/server.111/b28286/functions087.htm#sthref1621
This means that
select lengthb(rpad(bytestst ,10,'x')) lengthb, lengthc(rpad(bytestst ,10,'x')) lengthc, rpad(bytestst ,10,'x') from (select UNISTR('\00E9') bytestst from dual);
will result in 10 characters but 11 bytes ! ( é is in AL32UTF8 2 bytes).
In complex scripts where one “character on the display” may be composed by several combined characters the difference (nr of chars vs bytes) may be even bigger.
With some scripts the number of characters returned may also vary, for example most Asian characters are 2 “display units” wide, so the Japanese 住 (which is a FULL WIDTH (or zenkaku) character) will “count for 2″ in L/RPAD.
select lengthc(rpad(japanesechar ,10,'x')) lengthc,
rpad(japanesechar ,10,'x') from (select UNISTR('\4F4F')
japanesechar from dual);Lengthc will return 9 characters.
The thing is that the Japanese character 住 takes 2 times the display width of a ascii character like x hence the RPAD will make the string 9 characters in total (2 time display for one 住 + 8 times display for 8 x = 10 display units).
There where some issues with this: Bug 5010130 , this is fixed in 11.1.0.6 and up.
To have a certain number of characters one can use something like:
RPAD ( str , n - LENGTHC(str),'c')Use LENGTHB if the requested width is in bytes, use LENGTHC if if the requested width is in characters.
SUBSTR( str || RPAD( 'c', n, 'c' ), 1, n )Use SUBSTRB if the requested width is in bytes,use SUBSTR if the requested width is in characters.
In above
* str is the string to be padded.
*’c’ is the fill character — this must be a single-byte char, usually “space” is used.
* n is the requested width in bytes or characters
B.13) Using LIKE and INSTR.
When using the LIKE operator it might be an idea to have a look at Note 232085.1 comparison of LIKE2, LIKE4 and LIKEC operators. The same is true for the INSTR/INSTR2/INSTR4/INSTRC functions.
B.14) Character functions that are returning character values might silently truncate data.
Be aware that character functions that are returning character values might silently truncate data , please see
http://download.oracle.com/docs/cd/B28359_01/server.111/b28286/functions001.htm#sthref928
These functions are: CHR, CONCAT, INITCAP, LOWER, LPAD, LTRIM, NLS_INITCAP,
NLS_LOWER, NLSSORT, NLS_UPPER, REGEXP_REPLACE, REGEXP_SUBSTR, REPLACE,
RPAD, RTRIM, SOUNDEX, SUBSTR, TRANSLATE, TREAT, TRIM and UPPER.
A example with replace to illustrate this can be found in Bug 4185519 . Please note that this is NOT a bug but intended behavior.
B.15) Column size triple when using Materialized Views / CTAS trough database link.
When using Materialized Views or doing CTAS of a table trough a database link it is strongly suggested to use CHAR semantics on BOTH sides to avoid problem with columns size and to avoid ORA-1401 or ORA-12899 errors.
From 8 bit (WE8xxxx) source to (AL32)UTF8 (CTAS target/MV side) database:
When not using CHAR semantics but BYTE semantics the column size will triple ( a CHAR 20 BYTE will show up as 60 BYTE in the AL32UTF8/UTF8 db) .
There is enhancement request Bug 9901628 - add parameter to control expansion factor of columns for al32utf8
If it’s really needed to use BYTE then the _keep_remote_column_size=true parameter can be set at the CTAS/MV side, this will very likely provoke ORA-1401 or ORA-12899 errors as the data will expand when there are non-US7ASCII characters.
To avoid running into ORA-1401 or ORA-12899 during the refresh when using CHAR semantics on both sides make sure the 8 bite side has no columns bigger then CHAR (666 CHAR) or VARCHAR2 (1333 CHAR).
* _keep_remote_column_size=true is NOT compatible with using CHAR semantics.
From AL32UTF8 to 8 bit (xx8xxxx) (CTAS target/MV side) database:
The BYTE length of the columns will not be adapted. The byte length of the data will remain the same or decrease no issues are to be expected except data that is NOT supported in the target characterset will be lost. This is possible since the 8 bit characterset only support a subset of AL32UTF8.
Example: when creating a MV in a WE8MSWIN1252 database and the base table in the AL32UTF8 database contains Arabic or Japanese then the Arabic or Japanese will be replaced by a inverse question mark. This is normal as WE8MSWIN1252 only defines West European languages not Arabic or Japanese.
B.16) When fetching data from non-AL32UTF8 databases using cursors (PL/SQL)
Please see Note 269381.1 ORA-01406 or ORA-06502 in PLSQL when querying data in (AL32)UTF8 db from remote non-UTF8 db using cursor
B.17) When using HTMLDB.
When you are using HTMLDB there will be a problem with the passwords after the change to (AL32)UTF8 log a tar ( 5 / RDBMS / NLS ) and refer to this note.
B.18) When using non-US7ASCII names in directory’s or file names.
This is basically a pretty bad idea on windows platforms. Also on Unix platforms there are possible problems. More information is in Note 738758.1 Using Non US7ASCII characters in filenames with Extproc, Bfile and other callouts.
Note that using non-US7ASCII characters for datafile names/directory’s in an AL32UTF8 database is NOT supported.
B.19) When using XDB (xmltype).
Please see Note 229291.1 XDB (xmltype) and NLS related issues for 9.2 and up
B.20) Upper and NLS_upper give unexpected results on the µ symbol.
See Note 1148599.1 Upper and NLS_Upper on the µ ( Micro ) symbol in an Unicode (AL32UTF8 or AL16UTF16 database)
C) The Client side.
C.1) Common misconceptions about NLS_LANG.
It’s common to think that the NLS_LANG should be UTF8 or AL32UTF8 when connecting to a AL32UTF8 database. This is not necessarily true, the NLS_LANG has in fact no relation with the database characterset. It’s purpose is to let oracle know what the client characterset is, so that Oracle can do the needed conversion.
Note 158577.1 NLS_LANG Explained (How does Client-Server Character Conversion Work?)
1.2 What is this NLS_LANG thing anyway?
Please make the difference between a client who can connect to a Unicode database (which is any 8.0 and up client for an UTF8 database and any 9i and up client for an AL32UTF8 database) and a “real” Unicode client.
A “real Unicode client” means a client who can display/insert all characters know by Unicode without need to recompile or change (Operating system) settings.
C.2) Configuring your UNIX client to be an UTF-8 (Unicode) client.
To have a Unix “Unicode client” you need to configure your Unix environment first to use UTF-8, then you have to check your telnet/ssh software to be able to use Unicode and then (as last step) you can use a NLS_LANG set to for example AMERICAN_AMERICA.AL32UTF8 and start sqlplus.
This is explained in Note 264157.1 The correct NLS_LANG setting in Unix Environments
C.3) Configuring your Windows client to be an UTF-8 (Unicode) client.
On Windows sqlplusw.exe or sqlplus.exe cannot be used as an Unicode / AL32UTF8 client to display or insert data interactively (= type things and select /update data by “typing” in sqlplus), in almost ALL cases using sqlplus(w).exe with a nls_lang set to AL32UTF8 is totally incorrect. This is explained in Note 179133.1 The correct NLS_LANG in a Windows Environment.
*run a script that is encoded in UTF-8.
* create a spool file in UTF-8
Please do note that this is about the encoding/characterset of the .sql or spool file, see point C6).
There are 2 Oracle provided AL32UTF8 (Unicode) sql clients for windows systems:
* iSqlplus
Note 231231.1 Quick setup of iSQL*Plus 9.2 as Unicode (UTF8) client on windows.
Note 281847.1 How do I configure or test iSQL*Plus 10i?
* Oracle SQL Developer
This is also an Unicode client which can be downloaded for free from: http://www.oracle.com/technology/products/database/sql_developer/index.html
When using / writing an Unicode application on windows then this application should be specifically written to use the Unicode API of windows, setting the NLS_LANG to AMERICAN_AMERICA.AL32UTF8 is not enough to “make” an application Unicode.
Please consult your application vendor or vendor of the development environment to check if and how to use this as an AL32UTF8 client or Unicode programming environment.
Older versions of the popular Toad tool are NOT able to run against AL32UTF8 databases. Please do not use the “workaround” of setting NLS_LANG to UTF8 for toad, data will get corrupted doing this.
Information about this is found on Quest’s website here and here .
For current Unicode support questions about Toad, please contact Quest software.
All Oracle 9i and up Oracle clients are “compatible” with an AL32UTF8 database, even when using a non-AL32UTF8 NLS_LANG (see point C.1) ).
It is perfectly correct to use sqlplusw.exe on (for example) a West European / US Windows client ( using a NLS_LANG set to AMERICAN_AMERICA.WE8MSWIN1252, which is the correct value for a West European / US Windows system ) to connect to a AL32UTF8 database.
However the Western sqlplus client will then only able to insert / view West European characters.
If another user, using sqlplus on a correctly configured Chinese Windows system, inserts data then this will not be visible in the West European sqlplus client. When updating the Chinese data using the West European client this will delete the Chinese data, it will become an inverted question mark.
C.4) The default column width of output in sqlplus will change.
An other often unnoticed side effect (until the migration is done) of going to AL32UTF8 is that the default column width of output will change in sqlplus.
The output in sqlplus when connected to a WE8MSWIN1252 or a other 8 bit characterset database
SQL> select rpad(dummy,10,'x') from dual;
RPAD(DUMMY,10,'X')
----------
Xxxxxxxxxx
becomes when connected to an AL32UTF8 database
SQL> select rpad(dummy,10,'x') from dual;
RPAD(DUMMY,10,'X')
----------------------------------------
Xxxxxxxxxx
For more information please see Note 330717.1 Output widths change after upgrade, or change of character set.
C.5) Configuring your web based client to be a Unicode client.
When using web based applications it’s advisable to read
Note 229786.1 NLS_LANG and webservers explained.
Note 115001.1 NLS_LANG Client Settings and JDBC Drivers
C.6) Using Sqlplus to run scripts in UTF-8 encoding
When using sqlplus to run .sql file that contain non-US7ASCII characters the NLS_LANG needs to be the encoding of the file.
So, if you load for example a txt/sql flat file made on a windows box in notepad then by default this is WE8MSWIN1252, hence if you run this in sqlplus the NLS_LANG needs to be WE8MSWIN1252 , it doesn’t matter if this is ran in sqlplus on unix or <insert any other os >.
In general most customers
or have a policy that you should NOT use any non-US7ASCII characters in sql code / scripts – in that case the used nls_lang is not an issue.
or have a policy to use UTF8 encoding without BOM (more on that later) for every flat txt/sql file
If you want to load UTF-8 encoded .sql files in sqlplus then the NLS_LANG needs to be set to <Language>_<Territory>.AL32UTF8 (for example AMERICAN_AMERICA.AL32UTF8 ) regardless of the platform (so also on windows) as it needs to match the encoding of the file. The problem is that if you use this setup on windows you will NOT be able to “see” the UTF8 data properly, aldo (assuming the .sql file contains proper UTF-8 data) it will be inserted correctly – you can check the result with Sqldeveloper for example. Only on Unix platforms you can configure a system so that you can use a NLS_LANG set to UTF8/AL32UTF8 and also use this for both loading .sql files in UTF-8 and inserting/selecting UTF8 data “interactively”.
When using UTF8/AL32UTF8 as NLS_LANG there is one other issue: the BOM ( http://unicode.org/faq/utf_bom.html#BOM ) .This is an (not mandatory but allowed) 3 bytes sequence at the start of a flat file to indicate a file is in UTF-8 encoding (products like notepad insert this).
This gives a problem in sqlplus, assume 2 files saved in notepad as UTF8 (file save as – choose UTF8):
test1.sql contains on line:
select sysdate from dual;test2.sql contains two lines:
-- this file needs to be saved as UTF-8
select sysdate from dual;this gives in sqlplus.exe on windows when using a NLS_LANG set to UTF8/AL32UTF8:
SQL> @d:\test1.sql
SP2-0734: unknown command beginning "´╗┐select sy..." - rest of line ignored.
SQL> @d:\test2.sql
SP2-0734: unknown command beginning "´╗┐-- this f..." - rest of line ignored.
SYSDATE
---------------
23-JUL-09
SQL>
The “workaround” of putting a comment line as first line is a bit “ugly”, it gives SP2-0734, but it works.
An other solution is to use a script or so that removes the BOM (if present) from flat/txt files http://www.w3.org/International/questions/qa-utf8-bom
In much the same way one can us the sqlplus on a windows system to create UTF-8 spool files.
Note however that the output in sqlplus will NOT be correct, but the resulting data in the spool file will be.
To check a UTF-8 file on Windows notepad can be used, file -open – choose “UTF-8″ as “encoding”.
If you see squares instead of the characters then try an other font like “Arial Unicode MS”.
C.7) Spooling files using sqlplus is much slower using NLS_LANG set to UTF8 or AL32UTF8
This is Bug 6350579 - spooling with trimspool and linesize takes too long in 10.2.0.3
Fixed-Releases: 11.2.0.1, 10.2.0.5, Windows 10.2.0.4.0 Patch 14 and up, does not happen in 9i
Details: when using a NLS_LANG set to UTF8 or AL32UTF8 spooling a file will take a huge amount of time compared to 9i, the time is especially longer when using ” trimspool on” in sqlplus
Workaround: use non-UTF8 nls_lang or use trimspool off
C.8) Using Oracle Applications.
Please see:
Note 393861.1 Globalization Guide for Oracle Applications Release 12
Note 222663.1 Internationalization Update Notes for the Oracle E-Business Suite 11i
C.9) Using Portal.
You need to change after the database change the Portal dad to reflect the new character set. You can do this by:
1. Login to EM Application Server Control
2. Click on the midtier farm
3. Click on HTTP_Server
4. Click on Administration
5. Click on PL/SQL Properties
6. Scroll down to the DAD and click on it.
Within there you can modify the NLS Language to match your db. Press Apply after making the change.
C.10) Oracle Forms PDF and Unicode
Note 97441.1 Does Oracle Reports Support Unicode ( UTF8 ) Characters in PDF Output?
C.11) Changing a database to AL32UTF8 hosting an OracleAS 10g Metadata Repository.
Follow Note 260192.1 Changing the NLS_CHARACTERSET to AL32UTF8/UTF8 (Unicode) and then follow the steps in the documentation set Oracle Application Server Administrator’s Guide 10g Release 2 (10.1.2) , 6.5 Changing the Character Set of OracleAS Metadata Repository
D) Known Issues
Bug 6350579 - spooling with trimspool and linesize takes too long in 10.2.0.3
Fixed-Releases: 11.2.0.1, 10.2.0.5, Windows 10.2.0.4.0 Patch 14 and up,
Details: when using a NLS_LANG set to UTF8 or AL32UTF8 spooling a file will take a huge amount of time compared to 9i, the time is especially longer when using ” trimspool on” in sqlplus
Workaround: use non-UTF8 nls_lang or do not use trimspool
Bug 9727970 NLS_INITCAP RETURNS WRONG RESULTS WITH NLS_SORT=’XTURKISH’
Fixed-Releases: not fxed yet
Details: NLS_INITCAP gives wrong for Turkish result in a non-Turkish characterset like AL32UTF8, NLS_UPPER and NLS_LOWER work
Workaround: use NLS_INITCAP( NLS_LOWER( … ) … ) instead.
bug 5010130 LPAD/RPAD BEHAVIOR NOT CONSISTENT IN PRE 10GR2 RELEASES WITH AMBIGUOUS CHARS
Fixed-Releases: 11.1.0.6
Details: Unicode Ambiguous characters can have an display width of 1 or 2 depending on the context , resulting in different behavior when used in L/Rpad, AL16UTF16 implements the Unicode Ambiguous character behavior fully, in 10.2 UTF8 defines all Unicode Ambiguous characters as display width 1, AL32UTF8 defines Unicode Ambiguous characters as display width 2.
Fixed in 11.1.0.6 (both UTF8 and AL32UTF8 define now Unicode Ambiguous characters as display width 1, use AL16UTF16 as NLS_NCHAR_CHARACTERSET and N-types to have the full Unicode Ambiguous character behavior) , backport to 10g possible.
The Unicode EastAsianWidth.txt shows what characters are “Ambiguous” , they have “A” as East Asian Width property. Note that this includes non-Asian characters like some Greek or Cyrillic characters.
Bug 5581731 Errors loading wrapped PLSQL in multibyte from client other than SQLPLUS
Fixed-Releases: 11.1.0.6, 10.2.0.4
Details: Attempting to load / compile a wrapped PLSQL input via a mechanism other than SQL*Plus (eg: OCI) if the database’s character set is multi-byte can fail with PLS-103, ORA-24344 or other syntax-related compilation errors.
Workaround: Remove trailing blank lines and / from the input
© 2009 – 2011, www.oracledatabase12g.com. 版权所有.文章允许转载,但必须以链接方式注明源地址,否则追究法律责任.
相关文章 | Related posts:




最新评论