Skip to content

Conversation

stephentoub
Copy link
Member

Sometimes you see patterns where folks have put the same anchor multiple times in a row, e.g. \b\b. The subsequent anchors are nops and can just be removed.

Sometimes you see patterns where folks have put the same anchor multiple times in a row, e.g. `\b\b`. The subsequent anchors are nops and can just be removed.
@stephentoub stephentoub requested review from MihaZupan and Copilot July 26, 2025 10:59
@stephentoub
Copy link
Member Author

@MihuBot regexdiff

Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR implements optimization for regex patterns by coalescing adjacent equivalent anchors (e.g., \b\b becomes \b). This improves regex compilation performance by removing redundant anchor patterns that don't affect matching behavior.

Key changes:

  • Adds logic to detect and remove consecutive identical anchor patterns during regex node reduction
  • Adds comprehensive test coverage for various anchor types and combinations

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
RegexNode.cs Implements anchor coalescing logic in the regex reduction algorithm
RegexReductionTests.cs Adds test cases for anchor coalescing scenarios and edge cases
Comments suppressed due to low confidence (1)

src/libraries/System.Text.RegularExpressions/tests/UnitTests/RegexReductionTests.cs:384

  • This test case doesn't demonstrate anchor coalescing since it only contains a single $ anchor. Consider using [InlineData(@"$$", @"$")] to test coalescing of multiple end-of-string anchors.
        [InlineData(@"$", @"$")]

@MihuBot
Copy link

MihuBot commented Jul 26, 2025

79 out of 18857 patterns have generated source code changes.

Examples of GeneratedRegex source diffs
"(?<desc>h|ampm|am\\b|a\\.m\\.|a m\\b|a\\. m\ ..." (114 uses)
[GeneratedRegex("(?<desc>h|ampm|am\\b|a\\.m\\.|a m\\b|a\\. m\\.|a\\.m\\b|a\\. m\\b|pm\\b|p\\.m\\.|p m\\b|p\\. m\\.|p\\.m\\b|p\\. m\\b|p\\b\\b)", RegexOptions.IgnoreCase | RegexOptions.Singleline)]
  ///         ○ Match a sequence of expressions.<br/>
  ///             ○ Match a character in the set [Pp].<br/>
  ///             ○ Match if at a word boundary.<br/>
-   ///             ○ Match if at a word boundary.<br/>
  /// </code>
  /// </remarks>
  [global::System.CodeDom.Compiler.GeneratedCodeAttribute("System.Text.RegularExpressions.Generator", "42.42.42.42")]
                                  return false; // The input didn't match.
                              }
                              
-                               // Match if at a word boundary.
-                               if (!Utilities.IsBoundary(inputSpan, pos + 1))
-                               {
-                                   UncaptureUntil(0);
-                                   return false; // The input didn't match.
-                               }
-                               
                              pos++;
                              slice = inputSpan.Slice(pos);
                          }
"^^(?<AmsNetId>((?<First>\\d{1,3})\\.(?<Secon ..." (50 uses)
[GeneratedRegex("^^(?<AmsNetId>((?<First>\\d{1,3})\\.(?<Second>\\d{1,3})\\.(?<Third>\\d{1,3})\\.(?<Fourth>\\d{1,3})\\.(?<Fifth>\\d{1,3})\\.(?<Sixth>\\d{1,3})) | Local | Empty | LocalHost)(:(?<AdsPort>\\d+))?$$", RegexOptions.IgnoreCase | RegexOptions.IgnorePatternWhitespace | RegexOptions.CultureInvariant)]
  /// Explanation:<br/>
  /// <code>
  /// ○ Match if at the beginning of the string.<br/>
-   /// ○ Match if at the beginning of the string.<br/>
  /// ○ "AmsNetId" capture group.<br/>
  ///     ○ Match with 4 alternative expressions.<br/>
  ///         ○ 1st capture group.<br/>
  ///         ○ "AdsPort" capture group.<br/>
  ///             ○ Match a Unicode digit atomically at least once.<br/>
  /// ○ Match if at the end of the string or if before an ending newline.<br/>
-   /// ○ Match if at the end of the string or if before an ending newline.<br/>
  /// </code>
  /// </remarks>
  [global::System.CodeDom.Compiler.GeneratedCodeAttribute("System.Text.RegularExpressions.Generator", "42.42.42.42")]
                      return false; // The input didn't match.
                  }
                  
-                   // Match if at the beginning of the string.
-                   if (pos != 0)
-                   {
-                       UncaptureUntil(0);
-                       return false; // The input didn't match.
-                   }
-                   
                  // "AmsNetId" capture group.
                  //{
                      capture_starting_pos = pos;
                      goto LoopIterationNoMatch;
                  }
                  
-                   // Match if at the end of the string or if before an ending newline.
-                   if (pos < inputSpan.Length - 1 || ((uint)pos < (uint)inputSpan.Length && inputSpan[pos] != '\n'))
-                   {
-                       goto LoopIterationNoMatch;
-                   }
-                   
                  // The input matched.
                  base.runtextpos = pos;
                  base.Capture(0, matchStart, pos);
"\\s+([`~!@#$%^&*\\(\\)=+\\|\\[\\]{};':,.<>?< ..." (26 uses)
[GeneratedRegex("\\s+([`~!@#$%^&*\\(\\)=+\\|\\[\\]{};':,.<>?<《》,。?;‘’“”:、¥!…()])+\\s+")]
             }
         }
         
-        /// <summary>Supports searching for characters in or not in "\t\n\v\f\r \u0085              \u2028\u2029  ".</summary>
-        internal static readonly SearchValues<char> s_whitespace = SearchValues.Create("\t\n\v\f\r \u0085              \u2028\u2029  ");
+        /// <summary>Supports searching for characters in or not in "\t\n\v\f\r \u0085             \u2028\u2029   ".</summary>
+        internal static readonly SearchValues<char> s_whitespace = SearchValues.Create("\t\n\v\f\r \u0085             \u2028\u2029   ");
     }
 }
"\\s+([`~!@#$%^&*\\(\\)\\-_=+\\\\|\\[\\]{};': ..." (26 uses)
[GeneratedRegex("\\s+([`~!@#$%^&*\\(\\)\\-_=+\\\\|\\[\\]{};':,.<>/?<《》,。?;‘’“”:、—¥!…—()])+\\s+")]
             }
         }
         
-        /// <summary>Supports searching for characters in or not in "\t\n\v\f\r \u0085              \u2028\u2029  ".</summary>
-        internal static readonly SearchValues<char> s_whitespace = SearchValues.Create("\t\n\v\f\r \u0085              \u2028\u2029  ");
+        /// <summary>Supports searching for characters in or not in "\t\n\v\f\r \u0085             \u2028\u2029   ".</summary>
+        internal static readonly SearchValues<char> s_whitespace = SearchValues.Create("\t\n\v\f\r \u0085             \u2028\u2029   ");
     }
 }
"^\\s+(?<guid>\\{[0-9a-zA-Z]{8}-[0-9a-zA-Z]{4 ..." (26 uses)
[GeneratedRegex("^\\s+(?<guid>\\{[0-9a-zA-Z]{8}-[0-9a-zA-Z]{4}-[0-9a-zA-Z]{4}-[0-9a-zA-Z]{4}-[0-9a-zA-Z]{12}\\})\\s+=\\s+(?<dep>\\{[0-9a-zA-Z]{8}-[0-9a-zA-Z]{4}-[0-9a-zA-Z]{4}-[0-9a-zA-Z]{4}-[0-9a-zA-Z]{12}\\})", RegexOptions.Multiline)]
         /// <summary>Whether <see cref="s_defaultTimeout"/> is non-infinite.</summary>
         internal static readonly bool s_hasTimeout = s_defaultTimeout != Regex.InfiniteMatchTimeout;
         
-        /// <summary>Supports searching for characters in or not in "\t\n\v\f\r \u0085             \u2028\u2029   ".</summary>
-        internal static readonly SearchValues<char> s_whitespace = SearchValues.Create("\t\n\v\f\r \u0085             \u2028\u2029   ");
+        /// <summary>Supports searching for characters in or not in "\t\n\v\f\r \u0085             \u2028\u2029   ".</summary>
+        internal static readonly SearchValues<char> s_whitespace = SearchValues.Create("\t\n\v\f\r \u0085             \u2028\u2029   ");
     }
 }
"(?<id>\\S+)" (21 uses)
[GeneratedRegex("(?<id>\\S+)")]
         /// <summary>Whether <see cref="s_defaultTimeout"/> is non-infinite.</summary>
         internal static readonly bool s_hasTimeout = s_defaultTimeout != Regex.InfiniteMatchTimeout;
         
-        /// <summary>Supports searching for characters in or not in "\t\n\v\f\r \u0085             \u2028\u2029   ".</summary>
-        internal static readonly SearchValues<char> s_whitespace = SearchValues.Create("\t\n\v\f\r \u0085             \u2028\u2029   ");
+        /// <summary>Supports searching for characters in or not in "\t\n\v\f\r \u0085             \u2028\u2029   ".</summary>
+        internal static readonly SearchValues<char> s_whitespace = SearchValues.Create("\t\n\v\f\r \u0085             \u2028\u2029   ");
     }
 }
"(?<id>\\S+)(?<version>.*)" (21 uses)
[GeneratedRegex("(?<id>\\S+)(?<version>.*)")]
         /// <summary>Whether <see cref="s_defaultTimeout"/> is non-infinite.</summary>
         internal static readonly bool s_hasTimeout = s_defaultTimeout != Regex.InfiniteMatchTimeout;
         
-        /// <summary>Supports searching for characters in or not in "\t\n\v\f\r \u0085             \u2028\u2029   ".</summary>
-        internal static readonly SearchValues<char> s_whitespace = SearchValues.Create("\t\n\v\f\r \u0085             \u2028\u2029   ");
+        /// <summary>Supports searching for characters in or not in "\t\n\v\f\r \u0085             \u2028\u2029   ".</summary>
+        internal static readonly SearchValues<char> s_whitespace = SearchValues.Create("\t\n\v\f\r \u0085             \u2028\u2029   ");
     }
 }
"\\s+" (18 uses)
[GeneratedRegex("\\s+", RegexOptions.IgnoreCase | RegexOptions.CultureInvariant)]
         /// <summary>Whether <see cref="s_defaultTimeout"/> is non-infinite.</summary>
         internal static readonly bool s_hasTimeout = s_defaultTimeout != Regex.InfiniteMatchTimeout;
         
-        /// <summary>Supports searching for characters in or not in "\t\n\v\f\r \u0085             \u2028\u2029   ".</summary>
-        internal static readonly SearchValues<char> s_whitespace = SearchValues.Create("\t\n\v\f\r \u0085             \u2028\u2029   ");
+        /// <summary>Supports searching for characters in or not in "\t\n\v\f\r \u0085             \u2028\u2029   ".</summary>
+        internal static readonly SearchValues<char> s_whitespace = SearchValues.Create("\t\n\v\f\r \u0085             \u2028\u2029   ");
     }
 }
"\\s\\s+" (17 uses)
[GeneratedRegex("\\s\\s+")]
         /// <summary>Whether <see cref="s_defaultTimeout"/> is non-infinite.</summary>
         internal static readonly bool s_hasTimeout = s_defaultTimeout != Regex.InfiniteMatchTimeout;
         
-        /// <summary>Supports searching for characters in or not in "\t\n\v\f\r \u0085             \u2028\u2029   ".</summary>
-        internal static readonly SearchValues<char> s_whitespace = SearchValues.Create("\t\n\v\f\r \u0085             \u2028\u2029   ");
+        /// <summary>Supports searching for characters in or not in "\t\n\v\f\r \u0085             \u2028\u2029   ".</summary>
+        internal static readonly SearchValues<char> s_whitespace = SearchValues.Create("\t\n\v\f\r \u0085             \u2028\u2029   ");
     }
 }
"(?<desc>h|ampm|am\\b|a\\.m\\.|a m\\b|a\\. m\ ..." (16 uses)
[GeneratedRegex("(?<desc>h|ampm|am\\b|a\\.m\\.|a m\\b|a\\. m\\.|a\\.m\\b|a\\. m\\b|pm\\b|p\\.m\\.|p m\\b|p\\. m\\.|p\\.m\\b|p\\. m\\b|p\\b\\b)", RegexOptions.Singleline)]
  ///                 ○ Match a sequence of expressions.<br/>
  ///                     ○ Match the string " m".<br/>
  ///                     ○ Match if at a word boundary.<br/>
-   ///                 ○ Match a sequence of expressions.<br/>
-   ///                     ○ Match if at a word boundary.<br/>
-   ///                     ○ Match if at a word boundary.<br/>
+   ///                 ○ Match if at a word boundary.<br/>
  /// </code>
  /// </remarks>
  [global::System.CodeDom.Compiler.GeneratedCodeAttribute("System.Text.RegularExpressions.Generator", "42.42.42.42")]
                                              return false; // The input didn't match.
                                          }
                                          
-                                           // Match if at a word boundary.
-                                           if (!Utilities.IsBoundary(inputSpan, pos + 1))
-                                           {
-                                               UncaptureUntil(0);
-                                               return false; // The input didn't match.
-                                           }
-                                           
                                          pos++;
                                          slice = inputSpan.Slice(pos);
                                      }

For more diff examples, see https://gist.github.com/MihuBot/2273def877179c54bafb17a692e2f31d

Total bytes of base: 54274946
Total bytes of diff: 54274415
Total bytes of delta: -531 (-0.00 % of base)
Total relative delta: -0.19
    diff is an improvement.
    relative diff is an improvement.

For a list of JIT diff improvements, see Improvements.md

Sample source code for further analysis
const string JsonPath = "RegexResults-1284.json";
if (!File.Exists(JsonPath))
{
    await using var archiveStream = await new HttpClient().GetStreamAsync("https://mihubot.xyz/r/E2jppW7A");
    using var archive = new ZipArchive(archiveStream, ZipArchiveMode.Read);
    archive.Entries.First(e => e.Name == "Results.json").ExtractToFile(JsonPath);
}

using FileStream jsonFileStream = File.OpenRead(JsonPath);
RegexEntry[] entries = JsonSerializer.Deserialize<RegexEntry[]>(jsonFileStream, new JsonSerializerOptions { IncludeFields = true })!;
Console.WriteLine($"Working with {entries.Length} patterns");



record KnownPattern(string Pattern, RegexOptions Options, int Count);

sealed class RegexEntry
{
    public required KnownPattern Regex { get; set; }
    public required string MainSource { get; set; }
    public required string PrSource { get; set; }
    public string? FullDiff { get; set; }
    public string? ShortDiff { get; set; }
    public (string Name, string Values)[]? SearchValuesOfChar { get; set; }
    public (string[] Values, StringComparison ComparisonType)[]? SearchValuesOfString { get; set; }
}

@stephentoub
Copy link
Member Author

@MihaZupan, in your bot's results, what's changing in some of these patterns, e.g. "\s\s+" ?

@MihaZupan
Copy link
Member

Looks like one of the spaces is changing for some reason.
It's probably in my tooling somewhere, not sure why it's non-deterministic.

internal static readonly SearchValues<char> s_whitespace = SearchValues.Create("\t\n\v\f\r \u0085             \u2028\u2029   ");
115: 2008 2008
116: 2009 2009
117: 2009 200A <---
118: 005C 005C
119: 0075 0075
120: 0032 0032

@stephentoub
Copy link
Member Author

It's probably in my tooling somewhere, not sure why it's non-deterministic.

Hopefully. I'm a little worried your tooling is correct and there's some sort of issue in the generator.

@stephentoub stephentoub merged commit 197f38a into dotnet:main Jul 26, 2025
86 of 88 checks passed
@stephentoub stephentoub deleted the coalesceanchors branch July 26, 2025 17:37
@MihaZupan
Copy link
Member

MihaZupan commented Jul 26, 2025

Something weird is going on here.

I've added test logic into the generator: MihaZupan@9389bed
and it's (sometimes) failing here:
https://github.com/MihaZupan/runtime/blob/9389bede36ae8a7c54e9dde5674870437f5eddc3/src/libraries/System.Text.RegularExpressions/gen/RegexGenerator.Emitter.cs#L507-L510

With chars:

9, 10, 11, 12, 13, 32, 133, 160, 5760, 8192, 8193, 8194, 8287, 8196, 8197, 8198, 8199, 8200, 8201, 8202, 8232, 8233, 8239, 8287, 12288

note the 8194, 8287, 8196, instead of 8194, 8195, 8196, even though we're calling Array.Sort(chars) right before that.

@MihaZupan
Copy link
Member

Oh actually it makes sense, we're using the same char[] instance here


so with multithreaded access the Sort is corrupting the array.

@stephentoub
Copy link
Member Author

stephentoub commented Jul 26, 2025

Oh actually it makes sense, we're using the same char[] instance here

so with multithreaded access the Sort is corrupting the array.

OK, so a real bug. Nice job tracking it down.
Mind submitting a fix?

@github-actions github-actions bot locked and limited conversation to collaborators Aug 26, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants