Skip to content

Conversation

muellerj2
Copy link
Contributor

@muellerj2 muellerj2 commented Apr 5, 2025

Fixes #994 and solves the single-character case of #5391. Also makes slight progress towards #438.

Collating symbols and equivalences are now parsed following the intended semantics in the standard as much as is possible under the current layout of NFA nodes:

  1. Pass the name of the collating symbol to lookup_collatename() in the traits class to obtain the collating element.
  2. Reject the expression with error_collate if lookup_collatename() returns an empty string (see [re.grammar]/8 and 10). Also reject it with error_space if the returned collating element can't be stored in the NFA node.
  3. If this is an expression starting with [., then we should process the collating element like a normal character. We can easily do so if the collating element is just a single character. However, ranges bounded by such collating elements can't be represented in the NFA currently. For this reason, multi-character collating elements are treated syntactically as a set and immediately added to the NFA by calling _Nfa._Add_coll2(). This means that these collating elements can be individual elements of a character class, but any attempts to use them as range bounds will be rejected by the parser with an error_range.
  4. If this is an expression starting with [=, this denotes an equivalence class. If so, add the collating element to the NFA by calling _Nfa._Add_equiv2(). This function calls transform_primary() and checks if we can obtain a primary sort key for this collating element. If not, it will reject the equivalence with an error_collate (see [re.grammar]/10).
    • I doubt that the current implementation of transform_primary() in regex_traits can even return an empty string. But it might do so in the future when LWG-4186 regex_traits::transform_primary mistakenly detects typeid of a function #5291 is addressed.
    • It would be preferable to store the primary sort key rather than the collating element in the NFA in order to avoid the (repeated) computation of the sort key during matching. Unfortunately, though, this choice wasn't made in the past, and while the parser didn't do the name lookup through lookup_collatename(), this feature worked otherwise. So this feature seems to have been usable enough that we shouldn't deliberately break people just for some performance gain.
    • In contrast, I did make such a change for collating elements: These are now passed through translate() or translate_nocase() before being stored, and the matcher assumes that this translation pass has happened. But the major difference is: This feature is utterly broken before this PR (as documented in <regex>: Regex erroneously returns a match #994). So I believe this assumption is fine, because even if the new matcher after this PR is picked up with the old parser before this PR, at worst there will just be some patterns for which some inputs are still mistakenly not matched; this is still much better than the utterly broken behavior that the old matcher produces.

In the matcher, the implementation of _Lookup_coll was completely revised as the old one was utterly broken. The new implementation now assumes that the collating elements are sorted descendingly in length. The parser has always sorted them in this way (as it took advantage of the sorting), but nothing in the matcher has made use of it until now.

  • I can also change the code to remove this assumption if this is your preference.

In addition, I also added code that marks repetitions surrounding some character classes as "not simple". The reason is that these character classes can potentially match character sequences of various lengths, so they can behave like an "if". The "simple loop optimization" doesn't work when the repeated subexpression can branch in the sense that it can match the same input in observably different ways (e.g., substrings of different lengths can be matched or capture groups are located differently). For this reason, repetitions surrounding such character classes should not be marked "simple", and I think this PR is the best opportunity to make this change since it repairs the matching of multi-character collating elements in character classes.

  • I made a very conservative assessment what character classes should cause a surrounding repetition to be marked as "non-simple"; I included any character classes that I could somehow reasonably imagine to match character sequences of various lengths today or in the future, but this certainly includes too many. I settled on such a conservative choice for two reasons: First, it's trivial to enable the simple loop optimization, but it's basically impossible to take it back later (except if the feature related to this change is utterly broken anyway, like collating symbols before this PR and regex_constants::collate before <regex>: Implement collating ranges #5238). Second, we might change which character classes can potentially branch in the future. For example, let's assume that the collating element "ch" from the Czech alphabet is recognized. Currently, the pattern "[c[.ch.]]h" doesn't match "ch", but there is an argument to be had that it should. So all in all, if we mark too many repetitions surrounding character classes as "simple" now, it might become much more difficult or even impossible to change the semantics of character classes to a more appropriate choice later because we have to maintain backwards compatibility.

Finally, we have to make choice what collating symbol names should and can be recognized by the regex implementation by default (see #5393). Here, I have provisionally made the minimal safe choice: Only single characters are recognized by default, as every character is a collating element by definition.

@muellerj2 muellerj2 requested a review from a team as a code owner April 5, 2025 17:33
@github-project-automation github-project-automation bot moved this to Initial Review in STL Code Reviews Apr 5, 2025
@StephanTLavavej StephanTLavavej added bug Something isn't working regex meow is a substring of homeowner labels Apr 6, 2025
@StephanTLavavej StephanTLavavej self-assigned this Apr 6, 2025
@StephanTLavavej
Copy link
Member

Thanks! 😻 I pushed a source-conflict-free merge with main, fixes for nitpicks and simple issues, and resolved a stealth merge conflict.

I had a couple of outstanding questions (see unresolved comments above) but they don't block merging.

@StephanTLavavej StephanTLavavej removed their assignment Apr 18, 2025
@StephanTLavavej StephanTLavavej moved this from Initial Review to Ready To Merge in STL Code Reviews Apr 18, 2025
@StephanTLavavej
Copy link
Member

Closing and reopening to wake up the snoozy CLA bot...

@github-project-automation github-project-automation bot moved this from Ready To Merge to Done in STL Code Reviews Apr 18, 2025
@github-project-automation github-project-automation bot moved this from Done to Initial Review in STL Code Reviews Apr 18, 2025
@StephanTLavavej StephanTLavavej moved this from Initial Review to Ready To Merge in STL Code Reviews Apr 18, 2025
@StephanTLavavej StephanTLavavej moved this from Ready To Merge to Merging in STL Code Reviews Apr 22, 2025
@StephanTLavavej
Copy link
Member

I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed.

@StephanTLavavej StephanTLavavej merged commit 81056a9 into microsoft:main Apr 22, 2025
39 checks passed
@github-project-automation github-project-automation bot moved this from Merging to Done in STL Code Reviews Apr 22, 2025
@StephanTLavavej
Copy link
Member

Thanks again for fixing these long-standing bugs! 😻 🌍 🔡

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working regex meow is a substring of homeowner
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

<regex>: Regex erroneously returns a match
2 participants