Skip to content

ACE/TAO Wide Strings on Linux #2145

@wkbrd

Description

@wkbrd

Discussed in #2144

Originally posted by wkbrd October 16, 2023
We are attempting to use wide strings (CORBA::WChar*) on Linux, however we are finding that the encoding does not seem to behave as expected for characters whose representations do not fit into a single UTF-16 character.

On Linux, wchar_t is a 32-bit type. When a wide string comes over the wire, including from other ORBs, the default behavior is that each character in the string is the next 16-bit unit of the UTF-16 representation of the string, rather than the next character in the native UTF-32 layout. Is this intended?

For example, 🂡🂮🂭🂫🂪 is being marshaled as L"\xd83c\xdca1\xd83c\xdcae\xd83c\xdcad\xd83c\xdcab\xd83c\xdcaa"

A search of the source tree found references to -ORBNativeWcharCodeSet UCS-4 -ORBWcharCodesetTranslator WUCS4_UTF16_Factory in TAO/tests/CodeSets/simple/. Use of -ORBNativeWcharCodeSet UCS-4 produces an outcome where the string is represented as an array of wchar_t twice the length of the UTF-32 representation where each pair of elements are filled with the low 16 bits of the UTF-32 character followed by the high 16 bits of the UTF-32 character.

For example, 🂡🂮🂭🂫🂪 is being marshaled as L"¡\001®\001¬\001«\001ª\001"

(gdb) print /x wstrStreamName[0]
$2 = 0xf0a1
(gdb) print /x wstrStreamName[1]
$3 = 0x1
(gdb) print /x wstrStreamName[2]
$4 = 0xf0ae
(gdb) print /x wstrStreamName[3]
$5 = 0x1
(gdb) print /x wstrStreamName[4]
$6 = 0xf0ad
(gdb) print /x wstrStreamName[5]
$7 = 0x1
(gdb) print /x wstrStreamName[6]
$8 = 0xf0ab
(gdb) print /x wstrStreamName[7]
$9 = 0x1
(gdb) print /x wstrStreamName[8]
$10 = 0xf0aa
(gdb) print /x wstrStreamName[9]
$11 = 0x1
(gdb) print /x wstrStreamName[10]
$12 = 0x0
(gdb) print /x wstrStreamName[11]
$13 = 0x0

Character reference: https://en.wikipedia.org/wiki/Playing_cards_in_Unicode

Based on discussion content, we proceeded to attempt to use WUCS4_UTF16.cpp, though encountered issues. The associated PR addresses the issues.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions