-
Notifications
You must be signed in to change notification settings - Fork 1.6k
<regex>
: Properly parse and match collating symbols and equivalences
#5392
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
<regex>
: Properly parse and match collating symbols and equivalences
#5392
Conversation
Thanks! 😻 I pushed a source-conflict-free merge with I had a couple of outstanding questions (see unresolved comments above) but they don't block merging. |
Closing and reopening to wake up the snoozy CLA bot... |
I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed. |
Thanks again for fixing these long-standing bugs! 😻 🌍 🔡 |
Fixes #994 and solves the single-character case of #5391. Also makes slight progress towards #438.
Collating symbols and equivalences are now parsed following the intended semantics in the standard as much as is possible under the current layout of NFA nodes:
lookup_collatename()
in the traits class to obtain the collating element.error_collate
iflookup_collatename()
returns an empty string (see [re.grammar]/8 and 10). Also reject it witherror_space
if the returned collating element can't be stored in the NFA node.[.
, then we should process the collating element like a normal character. We can easily do so if the collating element is just a single character. However, ranges bounded by such collating elements can't be represented in the NFA currently. For this reason, multi-character collating elements are treated syntactically as a set and immediately added to the NFA by calling_Nfa._Add_coll2()
. This means that these collating elements can be individual elements of a character class, but any attempts to use them as range bounds will be rejected by the parser with anerror_range
.[=
, this denotes an equivalence class. If so, add the collating element to the NFA by calling_Nfa._Add_equiv2()
. This function callstransform_primary()
and checks if we can obtain a primary sort key for this collating element. If not, it will reject the equivalence with anerror_collate
(see [re.grammar]/10).transform_primary()
inregex_traits
can even return an empty string. But it might do so in the future when LWG-4186regex_traits::transform_primary
mistakenly detectstypeid
of a function #5291 is addressed.lookup_collatename()
, this feature worked otherwise. So this feature seems to have been usable enough that we shouldn't deliberately break people just for some performance gain.translate()
ortranslate_nocase()
before being stored, and the matcher assumes that this translation pass has happened. But the major difference is: This feature is utterly broken before this PR (as documented in <regex>: Regex erroneously returns a match #994). So I believe this assumption is fine, because even if the new matcher after this PR is picked up with the old parser before this PR, at worst there will just be some patterns for which some inputs are still mistakenly not matched; this is still much better than the utterly broken behavior that the old matcher produces.In the matcher, the implementation of
_Lookup_coll
was completely revised as the old one was utterly broken. The new implementation now assumes that the collating elements are sorted descendingly in length. The parser has always sorted them in this way (as it took advantage of the sorting), but nothing in the matcher has made use of it until now.In addition, I also added code that marks repetitions surrounding some character classes as "not simple". The reason is that these character classes can potentially match character sequences of various lengths, so they can behave like an "if". The "simple loop optimization" doesn't work when the repeated subexpression can branch in the sense that it can match the same input in observably different ways (e.g., substrings of different lengths can be matched or capture groups are located differently). For this reason, repetitions surrounding such character classes should not be marked "simple", and I think this PR is the best opportunity to make this change since it repairs the matching of multi-character collating elements in character classes.
regex_constants::collate
before<regex>
: Implement collating ranges #5238). Second, we might change which character classes can potentially branch in the future. For example, let's assume that the collating element "ch" from the Czech alphabet is recognized. Currently, the pattern "[c[.ch.]]h" doesn't match "ch", but there is an argument to be had that it should. So all in all, if we mark too many repetitions surrounding character classes as "simple" now, it might become much more difficult or even impossible to change the semantics of character classes to a more appropriate choice later because we have to maintain backwards compatibility.Finally, we have to make choice what collating symbol names should and can be recognized by the regex implementation by default (see #5393). Here, I have provisionally made the minimal safe choice: Only single characters are recognized by default, as every character is a collating element by definition.