-
Notifications
You must be signed in to change notification settings - Fork 386
Description
Discussed in #2144
Originally posted by wkbrd October 16, 2023
We are attempting to use wide strings (CORBA::WChar*) on Linux, however we are finding that the encoding does not seem to behave as expected for characters whose representations do not fit into a single UTF-16 character.
On Linux, wchar_t is a 32-bit type. When a wide string comes over the wire, including from other ORBs, the default behavior is that each character in the string is the next 16-bit unit of the UTF-16 representation of the string, rather than the next character in the native UTF-32 layout. Is this intended?
For example, 🂡🂮🂭🂫🂪 is being marshaled as L"\xd83c\xdca1\xd83c\xdcae\xd83c\xdcad\xd83c\xdcab\xd83c\xdcaa"
A search of the source tree found references to -ORBNativeWcharCodeSet UCS-4 -ORBWcharCodesetTranslator WUCS4_UTF16_Factory in TAO/tests/CodeSets/simple/. Use of -ORBNativeWcharCodeSet UCS-4 produces an outcome where the string is represented as an array of wchar_t twice the length of the UTF-32 representation where each pair of elements are filled with the low 16 bits of the UTF-32 character followed by the high 16 bits of the UTF-32 character.
For example, 🂡🂮🂭🂫🂪 is being marshaled as L"¡\001®\001¬\001«\001ª\001"
(gdb) print /x wstrStreamName[0]
$2 = 0xf0a1
(gdb) print /x wstrStreamName[1]
$3 = 0x1
(gdb) print /x wstrStreamName[2]
$4 = 0xf0ae
(gdb) print /x wstrStreamName[3]
$5 = 0x1
(gdb) print /x wstrStreamName[4]
$6 = 0xf0ad
(gdb) print /x wstrStreamName[5]
$7 = 0x1
(gdb) print /x wstrStreamName[6]
$8 = 0xf0ab
(gdb) print /x wstrStreamName[7]
$9 = 0x1
(gdb) print /x wstrStreamName[8]
$10 = 0xf0aa
(gdb) print /x wstrStreamName[9]
$11 = 0x1
(gdb) print /x wstrStreamName[10]
$12 = 0x0
(gdb) print /x wstrStreamName[11]
$13 = 0x0
Character reference: https://en.wikipedia.org/wiki/Playing_cards_in_Unicode
Based on discussion content, we proceeded to attempt to use WUCS4_UTF16.cpp, though encountered issues. The associated PR addresses the issues.