`<regex>`: Make `wregex` handle small character ranges containing U+00FF and U+0100 correctly #5437

muellerj2 · 2025-04-25T17:40:30Z

This fixes a weird bug for small character ranges near U+0100 that I stumbled upon while working on the proof-of-concept for eliminating _Uelem (#995). It's also a follow-up to #5238 and adds the characters in a collating range to the bitmap of the character class.

I will discuss the follow-up for collating ranges first. Strictly speaking, as of now we don't have to add characters in a range to the bitmap as long as we store the character range, because the matcher always checks character ranges before looking up the character in the bitmap. (This will become relevant for the bug as well.) So adding these characters to the bitmap for collating ranges seems like unnecessary work, because the difference can't be observed by users. However, I think it makes sense to invert this check order in the matcher in the longer run for performance reasons (regex_traits::transform() is relatively expensive, so should be avoided during matching). That's why I think it's a good idea to correctly update the bitmap for collating ranges now before the first official STL release containing the collating ranges fix becomes available, because this might allow us to assume in some future change that the bitmap is always filled correctly when the collate syntax flag is set.

Now let's discuss the bug. _Builder::_Add_range2() performs two optimizations when the collate flag is not set: The first optimization tries to put characters with code points <= 0xFF in a bitmap, while the second one replaces small ranges by the individual characters in this range.

In the first optimization, the loop condition checks _Ex0 <= _Ex1 && _Ex1 < _Get_bmax(), so it doesn't set the bits for characters between _Ex0 and U+00FF in the bitmap when _Ex1 >= 0x100.

This usually remains unobservable, though, because the matcher tests the stored character ranges first before looking up the bitmap. (I wonder whether this order was chosen as a workaround because these bits don't get set in the bitmap here. Or conversely the loop condition is like this because ranges are checked first in the matcher: If the optimization can't optimize the range away, it is actually a slight pessimization for the characters in the bitmap range performance-wise if they are represented in the bitmap and not by the range.)

But for a range with _Ex0 < 0x100 <=_Ex1, this leads to a problem when the second optimization in _Builder::_Add_range2 gets applied: This optimization assumes that it is applied to character ranges that don't overlap with the range of characters represented by the bitmap (i.e., it assumes _Ex0 >= 0x100). When it kicks in, the character range is no longer represented as a range in the NFA node, but instead all characters in the range are added to a character array that is intended for all matched characters with code points >= 0x100.

This leads to the following problem:

The first optimization doesn't kick in for _Ex1 >= 0x100, so the bits for characters with code points between _Ex0 and 0xFF aren't set in the bitmap.
The second optimization does kick in if _Ex1 - _Ex0 is small enough, so the character range is no longer represented as a character range. Instead, the characters are added to the matched character array for code points >= 0x100.
As a result, we have the following behavior when the matcher tries to match \u00FF to the character class [\u00FF-\u0100]:
- It won't find a match when checking character ranges because the character range is not represented as a range due to the second optimization.
- It won't find a match when looking up \u00FF in the bitmap because the bit for \u00ff is not set due to first optimization's loop condition.
- It won't examine the array for characters with code points >= 0x100 because the character should not be represented there.
- So it concludes that the character class does not match \u00FF.

To fix this, I opted to change the loop condition of the first optimization to _Ex0 <= _Ex1 && _Ex0 < _Bmp_max (where _Bmp_max == _Get_bmax()). As mentioned, this is actually a pessimization for characters with code points between 0x00 and 0xff if _Ex >= 0x100 and the second optimization isn't applied, but (a) these ranges starting at some code point <= 0xFF and ending at some code point >= 0x100 are probably rare in practice and (b) this is a first step towards evaluating the bitmap before ranges in the matcher.

Additional changes:

Because the code flow for collating ranges and normal ranges is now mostly separated, _Get_bmax() and _Get_tmax() have lost their usefulness. I replaced the functions by the returned constants and marked the members _Bmax and _Tmax with a // TRANSITION, ABI comment. (We can't remove the initialization of _Bmax and _Tmax in the constructor yet due to the possibility of mixing.)
I moved the check that ranges are non-empty into _Builder::_Add_range3() to avoid calling _Traits.transform() more than necessary. _Add_range2() was renamed to ensure that the check can't get lost when old and new TUs are mixed.

…0FF and U+0100 correctly

tests/std/tests/GH_005204_regex_collating_ranges/test.cpp

StephanTLavavej · 2025-05-03T07:09:45Z

Thanks as always for the thorough investigation, careful fix, and extremely detailed writeup to help mortals understand the problem and solution! 😻

I pushed a conflict-free merge and a minor expansion of test coverage with negative cases.

StephanTLavavej · 2025-05-09T15:48:56Z

I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed.

StephanTLavavej · 2025-05-10T08:47:54Z

Thanks for noticing and fixing this bug! 🦅 👁️ 🐞

<regex>: Make wregex handle small character ranges containing U+0…

e02bfa7

…0FF and U+0100 correctly

muellerj2 requested a review from a team as a code owner April 25, 2025 17:40

github-project-automation bot added this to STL Code Reviews Apr 25, 2025

github-project-automation bot moved this to Initial Review in STL Code Reviews Apr 25, 2025

StephanTLavavej added bug Something isn't working regex meow is a substring of homeowner labels Apr 25, 2025

StephanTLavavej self-assigned this Apr 25, 2025

This comment was marked as resolved.

Sign in to view

Merge branch 'main' into regex-handle-ranges-near-u+0100-correctly

96072d5

StephanTLavavej reviewed May 3, 2025

View reviewed changes

tests/std/tests/GH_005204_regex_collating_ranges/test.cpp Show resolved Hide resolved

Expand test coverage with negative cases.

f2b4d51

StephanTLavavej approved these changes May 3, 2025

View reviewed changes

StephanTLavavej removed their assignment May 3, 2025

StephanTLavavej moved this from Initial Review to Ready To Merge in STL Code Reviews May 3, 2025

StephanTLavavej mentioned this pull request May 3, 2025

Maintainer priorities #4700

Open

StephanTLavavej moved this from Ready To Merge to Merging in STL Code Reviews May 9, 2025

StephanTLavavej added a commit to StephanTLavavej/STL that referenced this pull request May 9, 2025

microsoftGH-5437

3551f8b

StephanTLavavej merged commit 94b9650 into microsoft:main May 10, 2025
39 checks passed

github-project-automation bot moved this from Merging to Done in STL Code Reviews May 10, 2025

StephanTLavavej mentioned this pull request May 10, 2025

<regex>: regex_traits::transform_primary should yield primary sort keys appropriate for the imbued locale #5444

Merged

muellerj2 deleted the regex-handle-ranges-near-u+0100-correctly branch May 31, 2025 21:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

`<regex>`: Make `wregex` handle small character ranges containing U+00FF and U+0100 correctly #5437

`<regex>`: Make `wregex` handle small character ranges containing U+00FF and U+0100 correctly #5437

muellerj2 commented Apr 25, 2025 •

edited

Loading

Uh oh!

This comment was marked as resolved.

Uh oh!

StephanTLavavej commented May 3, 2025

Uh oh!

StephanTLavavej commented May 9, 2025

Uh oh!

Uh oh!

StephanTLavavej commented May 10, 2025

Uh oh!

Uh oh!

<regex>: Make wregex handle small character ranges containing U+00FF and U+0100 correctly #5437

<regex>: Make wregex handle small character ranges containing U+00FF and U+0100 correctly #5437

Conversation

muellerj2 commented Apr 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment was marked as resolved.

Uh oh!

StephanTLavavej commented May 3, 2025

Uh oh!

StephanTLavavej commented May 9, 2025

Uh oh!

Uh oh!

StephanTLavavej commented May 10, 2025

Uh oh!

Uh oh!

`<regex>`: Make `wregex` handle small character ranges containing U+00FF and U+0100 correctly #5437

`<regex>`: Make `wregex` handle small character ranges containing U+00FF and U+0100 correctly #5437

muellerj2 commented Apr 25, 2025 •

edited

Loading