Skip to content

Conversation

muellerj2
Copy link
Contributor

@muellerj2 muellerj2 commented Dec 5, 2024

This PR deals with three related problems for character ranges in case-insensitive mode:

  • _Builder::_Add_range() casts the character bounds to unsigned int. As a consequence, characters with negative numeric values are not added to the bitmap, but rather to the _Large list of characters. This means that these characters are not found during matching. (Note the suspiciously different casts in the else branch for the case-sensitive case.)
  • The parser fails to reject some empty ranges in case-insensitive mode such as [Z-a] (= [z-a]).
  • When both the collate and icase flags are set, there is an unnecessary call to translate the bounds by _Traits.translate() first before passing them to _Traits.translate_nocase(). The standard says in [re.grammar]/14.1 and 14.2 that it is sufficient to call translate_nocase() only. (See also _Builder::_Add_char(), which already follows the Standard in this regard.)

The PR moves the entire character translation into the parser so that empty ranges can be reliably diagnosed there in case-insensitive mode as well. It also fixes the unsigned cast and removes the unnecessary translate() call.

The test deliberately does not use any manual signed/unsigned casts, but leaves all of these casts to char_traits to avoid getting the casts similarly wrong in <regex> and the test.

@muellerj2 muellerj2 requested a review from a team as a code owner December 5, 2024 12:40
Comment on lines -4121 to +4120
if (static_cast<typename _RxTraits::_Uelem>(_Val) < static_cast<typename _RxTraits::_Uelem>(_Chr1)) {
if (static_cast<typename _RxTraits::_Uelem>(_Chr2) < static_cast<typename _RxTraits::_Uelem>(_Chr1)) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that there is still a (pre-existing) bug in this error check when the collate flag is set: The bounds should be transformed by _Traits.transform() before performing the comparison.

However, the matcher fails to do this transform() dance as well, so I think this should rather be fixed for both matcher and parser in a coordinated fashion in a follow-up PR. (But I will adjust the PR if you rather prefer to fix this check completely now.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A follow-up PR will be great.

@CaseyCarter CaseyCarter added the bug Something isn't working label Dec 5, 2024
@StephanTLavavej StephanTLavavej self-assigned this Dec 5, 2024
@StephanTLavavej StephanTLavavej added the regex meow is a substring of homeowner label Jan 8, 2025
Comment on lines -4121 to +4120
if (static_cast<typename _RxTraits::_Uelem>(_Val) < static_cast<typename _RxTraits::_Uelem>(_Chr1)) {
if (static_cast<typename _RxTraits::_Uelem>(_Chr2) < static_cast<typename _RxTraits::_Uelem>(_Chr1)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A follow-up PR will be great.

@StephanTLavavej
Copy link
Member

Thanks - I really appreciate the detailed explanations in your PRs! 🐱

@StephanTLavavej StephanTLavavej removed their assignment Jan 12, 2025
@StephanTLavavej StephanTLavavej self-assigned this Jan 13, 2025
@StephanTLavavej
Copy link
Member

I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed.

@StephanTLavavej StephanTLavavej merged commit af0bd00 into microsoft:main Jan 14, 2025
39 checks passed
@StephanTLavavej
Copy link
Member

Thanks! 😻+

@muellerj2 muellerj2 deleted the fix-case-insensitive-char-ranges branch January 14, 2025 20:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working regex meow is a substring of homeowner
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

3 participants