`<regex>`: Properly parse and match collating symbols and equivalences #5392

muellerj2 · 2025-04-05T17:33:12Z

Fixes #994 and solves the single-character case of #5391. Also makes slight progress towards #438.

Collating symbols and equivalences are now parsed following the intended semantics in the standard as much as is possible under the current layout of NFA nodes:

Pass the name of the collating symbol to lookup_collatename() in the traits class to obtain the collating element.
Reject the expression with error_collate if lookup_collatename() returns an empty string (see [re.grammar]/8 and 10). Also reject it with error_space if the returned collating element can't be stored in the NFA node.
If this is an expression starting with [., then we should process the collating element like a normal character. We can easily do so if the collating element is just a single character. However, ranges bounded by such collating elements can't be represented in the NFA currently. For this reason, multi-character collating elements are treated syntactically as a set and immediately added to the NFA by calling _Nfa._Add_coll2(). This means that these collating elements can be individual elements of a character class, but any attempts to use them as range bounds will be rejected by the parser with an error_range.
If this is an expression starting with [=, this denotes an equivalence class. If so, add the collating element to the NFA by calling _Nfa._Add_equiv2(). This function calls transform_primary() and checks if we can obtain a primary sort key for this collating element. If not, it will reject the equivalence with an error_collate (see [re.grammar]/10).
- I doubt that the current implementation of transform_primary() in regex_traits can even return an empty string. But it might do so in the future when LWG-4186 regex_traits::transform_primary mistakenly detects typeid of a function #5291 is addressed.
- It would be preferable to store the primary sort key rather than the collating element in the NFA in order to avoid the (repeated) computation of the sort key during matching. Unfortunately, though, this choice wasn't made in the past, and while the parser didn't do the name lookup through lookup_collatename(), this feature worked otherwise. So this feature seems to have been usable enough that we shouldn't deliberately break people just for some performance gain.
- In contrast, I did make such a change for collating elements: These are now passed through translate() or translate_nocase() before being stored, and the matcher assumes that this translation pass has happened. But the major difference is: This feature is utterly broken before this PR (as documented in <regex>: Regex erroneously returns a match #994). So I believe this assumption is fine, because even if the new matcher after this PR is picked up with the old parser before this PR, at worst there will just be some patterns for which some inputs are still mistakenly not matched; this is still much better than the utterly broken behavior that the old matcher produces.

In the matcher, the implementation of _Lookup_coll was completely revised as the old one was utterly broken. The new implementation now assumes that the collating elements are sorted descendingly in length. The parser has always sorted them in this way (as it took advantage of the sorting), but nothing in the matcher has made use of it until now.

I can also change the code to remove this assumption if this is your preference.

In addition, I also added code that marks repetitions surrounding some character classes as "not simple". The reason is that these character classes can potentially match character sequences of various lengths, so they can behave like an "if". The "simple loop optimization" doesn't work when the repeated subexpression can branch in the sense that it can match the same input in observably different ways (e.g., substrings of different lengths can be matched or capture groups are located differently). For this reason, repetitions surrounding such character classes should not be marked "simple", and I think this PR is the best opportunity to make this change since it repairs the matching of multi-character collating elements in character classes.

I made a very conservative assessment what character classes should cause a surrounding repetition to be marked as "non-simple"; I included any character classes that I could somehow reasonably imagine to match character sequences of various lengths today or in the future, but this certainly includes too many. I settled on such a conservative choice for two reasons: First, it's trivial to enable the simple loop optimization, but it's basically impossible to take it back later (except if the feature related to this change is utterly broken anyway, like collating symbols before this PR and regex_constants::collate before <regex>: Implement collating ranges #5238). Second, we might change which character classes can potentially branch in the future. For example, let's assume that the collating element "ch" from the Czech alphabet is recognized. Currently, the pattern "[c[.ch.]]h" doesn't match "ch", but there is an argument to be had that it should. So all in all, if we mark too many repetitions surrounding character classes as "simple" now, it might become much more difficult or even impossible to change the semantics of character classes to a more appropriate choice later because we have to maintain backwards compatibility.

Finally, we have to make choice what collating symbol names should and can be recognized by the regex implementation by default (see #5393). Here, I have provisionally made the minimal safe choice: Only single characters are recognized by default, as every character is a collating element by definition.

stl/inc/regex

tests/std/tests/GH_005204_regex_collating_ranges/test.cpp

stl/inc/regex

StephanTLavavej · 2025-04-18T07:07:50Z

Thanks! 😻 I pushed a source-conflict-free merge with main, fixes for nitpicks and simple issues, and resolved a stealth merge conflict.

I had a couple of outstanding questions (see unresolved comments above) but they don't block merging.

StephanTLavavej · 2025-04-18T11:49:19Z

Closing and reopening to wake up the snoozy CLA bot...

StephanTLavavej · 2025-04-22T10:13:28Z

I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed.

StephanTLavavej · 2025-04-22T20:59:51Z

Thanks again for fixing these long-standing bugs! 😻 🌍 🔡

<regex>: Properly parse and match collating symbols and equivalences

628f350

muellerj2 requested a review from a team as a code owner April 5, 2025 17:33

github-project-automation bot added this to STL Code Reviews Apr 5, 2025

github-project-automation bot moved this to Initial Review in STL Code Reviews Apr 5, 2025

Fix conversion warning on x86

83ed422

StephanTLavavej added bug Something isn't working regex meow is a substring of homeowner labels Apr 6, 2025

StephanTLavavej self-assigned this Apr 6, 2025

muellerj2 mentioned this pull request Apr 6, 2025

<regex>: What names can and should regex_traits::lookup_collatename() recognize? #5393

Open

StephanTLavavej added 10 commits April 17, 2025 22:20

Merge branch 'main' into regex-repair-collating-symbol-support

f016eb0

Reuse _Str_first.

a4e4c13

Split error_collate and error_space checks.

9c43af9

Drop newlines.

ef446e2

Include <algorithm> for equal().

ba071de

Drop unnecessary qualification.

a15c1e4

If r.assign() fails, return;.

e92221c

Drop repeated test line.

730218e

Fix SKIP_COLLATE_TESTS usage.

71eba44

Fix stealth merge conflict in GH_005244_regex_escape_sequences.

13d0a87

StephanTLavavej reviewed Apr 18, 2025

View reviewed changes

StephanTLavavej approved these changes Apr 18, 2025

View reviewed changes

StephanTLavavej removed their assignment Apr 18, 2025

StephanTLavavej moved this from Initial Review to Ready To Merge in STL Code Reviews Apr 18, 2025

StephanTLavavej added 2 commits April 18, 2025 02:31

Defend lookup_collatename() against empty inputs.

4063407

Extend the SKIP_COLLATE_TESTS guard to cover should_throw() above.

876756a

StephanTLavavej approved these changes Apr 18, 2025

View reviewed changes

StephanTLavavej closed this Apr 18, 2025

github-project-automation bot moved this from Ready To Merge to Done in STL Code Reviews Apr 18, 2025

StephanTLavavej reopened this Apr 18, 2025

github-project-automation bot moved this from Done to Initial Review in STL Code Reviews Apr 18, 2025

StephanTLavavej moved this from Initial Review to Ready To Merge in STL Code Reviews Apr 18, 2025

StephanTLavavej moved this from Ready To Merge to Merging in STL Code Reviews Apr 22, 2025

StephanTLavavej merged commit 81056a9 into microsoft:main Apr 22, 2025
39 checks passed

github-project-automation bot moved this from Merging to Done in STL Code Reviews Apr 22, 2025

This was referenced Apr 22, 2025

<regex>: Make wregex correctly match negated character classes #5403

Merged

<regex>: Equivalence classes have unexpected behavior with std::wregex #5435

Closed

muellerj2 mentioned this pull request Apr 26, 2025

<regex>: regex_traits::transform_primary should yield primary sort keys appropriate for the imbued locale #5444

Merged

StephanTLavavej mentioned this pull request May 3, 2025

<regex>: Make wregex handle small character ranges containing U+00FF and U+0100 correctly #5437

Merged

muellerj2 deleted the regex-repair-collating-symbol-support branch May 31, 2025 21:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

`<regex>`: Properly parse and match collating symbols and equivalences #5392

`<regex>`: Properly parse and match collating symbols and equivalences #5392

Uh oh!

muellerj2 commented Apr 5, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

StephanTLavavej commented Apr 18, 2025

Uh oh!

StephanTLavavej commented Apr 18, 2025

Uh oh!

StephanTLavavej commented Apr 22, 2025

Uh oh!

Uh oh!

StephanTLavavej commented Apr 22, 2025

Uh oh!

Uh oh!

<regex>: Properly parse and match collating symbols and equivalences #5392

<regex>: Properly parse and match collating symbols and equivalences #5392

Uh oh!

Conversation

muellerj2 commented Apr 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

StephanTLavavej commented Apr 18, 2025

Uh oh!

StephanTLavavej commented Apr 18, 2025

Uh oh!

StephanTLavavej commented Apr 22, 2025

Uh oh!

Uh oh!

StephanTLavavej commented Apr 22, 2025

Uh oh!

Uh oh!

`<regex>`: Properly parse and match collating symbols and equivalences #5392

`<regex>`: Properly parse and match collating symbols and equivalences #5392

muellerj2 commented Apr 5, 2025 •

edited

Loading