Skip to content

Conversation

stephentoub
Copy link
Member

I was overthinking things before, and the handling for positive lookaheads can be simplified and expanded. The logic was saying that a loop followed by an positive lookahead could be made atomic if it was disjoint not only from the lookahead's contents but also from what followed the lookahead. However, the latter isn't actually necessary. If the loop is disjoint with the lookahead, then it's also disjoint with the intersection of the lookahead and what follows / overlaps with the lookahead.

I was overthinking things before, and the handling for positive lookaheads can be simplified and expanded. The logic was saying that a loop followed by an positive lookahead could be made atomic if it was disjoint not only from the lookahead's contents but also from what followed the lookahead. However, the latter isn't actually necessary. If the loop is disjoint with the lookahead, then it's also disjoint with the intersection of the lookahead and what follows / overlaps with the lookahead.
@stephentoub stephentoub requested review from MihaZupan and Copilot July 30, 2025 22:19
@stephentoub
Copy link
Member Author

@MihuBot regexdiff

Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR simplifies and improves the logic for positive lookahead handling in the CanBeMadeAtomic method within the regex reduction optimization system. The change eliminates unnecessary complexity by removing the requirement that a loop must be disjoint from both the lookahead contents AND what follows the lookahead.

Key changes:

  • Simplified positive lookahead handling logic in CanBeMadeAtomic
  • Added support for skipping through positive lookaheads during atomic optimization analysis
  • Expanded test coverage with additional regex reduction test cases

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
RegexNode.cs Simplified positive lookahead handling by removing redundant disjointness check and adding logic to skip through positive lookaheads
RegexReductionTests.cs Added new test cases covering positive lookahead scenarios and corrected an existing test expectation

Copy link
Contributor

Tagging subscribers to this area: @dotnet/area-system-text-regularexpressions
See info in area-owners.md if you want to be subscribed.

@MihuBot
Copy link

MihuBot commented Jul 30, 2025

7 out of 18857 patterns have generated source code changes.

Examples of GeneratedRegex source diffs
"\\G\\{%(\\s*(?<attrname>\\w+(?=\\W))(\\s*(?< ..." (1751 uses)
[GeneratedRegex("\\G\\{%(\\s*(?<attrname>\\w+(?=\\W))(\\s*(?<equal>=)\\s*'(?<attrval>[^']*)'|\\s*(?<equal>=)\\s*(?<attrval>[^\\s%>]*)|(?<equal>)(?<attrval>\\s*?)))*\\s*?%\\}")]
  ///     ○ 1st capture group.<br/>
  ///         ○ Match a whitespace character atomically any number of times.<br/>
  ///         ○ "attrname" capture group.<br/>
-   ///             ○ Match a word character greedily at least once.<br/>
+   ///             ○ Match a word character atomically at least once.<br/>
  ///             ○ Zero-width positive lookahead.<br/>
  ///                 ○ Match any character other than a word character.<br/>
  ///         ○ 2nd capture group.<br/>
                  char ch;
                  int charloop_starting_pos = 0, charloop_ending_pos = 0;
                  int charloop_starting_pos1 = 0, charloop_ending_pos1 = 0;
-                   int charloop_starting_pos2 = 0, charloop_ending_pos2 = 0;
                  int lazyloop_capturepos = 0;
                  int lazyloop_pos = 0;
                  int loop_iteration = 0;
                          }
                          
                          // "attrname" capture group.
-                           //{
+                           {
                              int capture_starting_pos1 = pos;
                              
-                               // Match a word character greedily at least once.
-                               //{
-                                   charloop_starting_pos = pos;
-                                   
+                               // Match a word character atomically at least once.
+                               {
                                  int iteration1 = 0;
                                  while ((uint)iteration1 < (uint)slice.Length && Utilities.IsWordChar(slice[iteration1]))
                                  {
                                  
                                  slice = slice.Slice(iteration1);
                                  pos += iteration1;
-                                   
-                                   charloop_ending_pos = pos;
-                                   charloop_starting_pos++;
-                                   goto CharLoopEnd;
-                                   
-                                   CharLoopBacktrack:
-                                   UncaptureUntil(base.runstack![--stackpos]);
-                                   Utilities.StackPop(base.runstack!, ref stackpos, out charloop_ending_pos, out charloop_starting_pos);
-                                   
-                                   if (Utilities.s_hasTimeout)
-                                   {
-                                       base.CheckTimeout();
-                                   }
-                                   
-                                   if (charloop_starting_pos >= charloop_ending_pos)
-                                   {
-                                       goto LoopIterationNoMatch;
-                                   }
-                                   pos = --charloop_ending_pos;
-                                   slice = inputSpan.Slice(pos);
-                                   
-                                   CharLoopEnd:
-                                   Utilities.StackPush(ref base.runstack!, ref stackpos, charloop_starting_pos, charloop_ending_pos, base.Crawlpos());
-                               //}
+                               }
                              
                              // Zero-width positive lookahead.
                              {
                                  // Match any character other than a word character.
                                  if (slice.IsEmpty || Utilities.IsWordChar(slice[0]))
                                  {
-                                       goto CharLoopBacktrack;
+                                       goto LoopIterationNoMatch;
                                  }
                                  
                                  pos = positivelookahead_starting_pos;
                              }
                              
                              base.Capture(3, capture_starting_pos1, pos);
-                               
-                               Utilities.StackPush(ref base.runstack!, ref stackpos, capture_starting_pos1);
-                               goto CaptureSkipBacktrack;
-                               
-                               CaptureBacktrack:
-                               capture_starting_pos1 = base.runstack![--stackpos];
-                               goto CharLoopBacktrack;
-                               
-                               CaptureSkipBacktrack:;
-                           //}
+                           }
                          
                          // 2nd capture group.
                          //{
                                      
                                      // Match a whitespace character greedily any number of times.
                                      //{
-                                           charloop_starting_pos1 = pos;
+                                           charloop_starting_pos = pos;
                                          
                                          int iteration6 = 0;
                                          while ((uint)iteration6 < (uint)slice.Length && char.IsWhiteSpace(slice[iteration6]))
                                          slice = slice.Slice(iteration6);
                                          pos += iteration6;
                                          
-                                           charloop_ending_pos1 = pos;
-                                           goto CharLoopEnd1;
+                                           charloop_ending_pos = pos;
+                                           goto CharLoopEnd;
                                          
-                                           CharLoopBacktrack1:
+                                           CharLoopBacktrack:
                                          UncaptureUntil(base.runstack![--stackpos]);
-                                           Utilities.StackPop(base.runstack!, ref stackpos, out charloop_ending_pos1, out charloop_starting_pos1);
+                                           Utilities.StackPop(base.runstack!, ref stackpos, out charloop_ending_pos, out charloop_starting_pos);
                                          
                                          if (Utilities.s_hasTimeout)
                                          {
                                              base.CheckTimeout();
                                          }
                                          
-                                           if (charloop_starting_pos1 >= charloop_ending_pos1)
+                                           if (charloop_starting_pos >= charloop_ending_pos)
                                          {
                                              goto AlternationBranch1;
                                          }
-                                           pos = --charloop_ending_pos1;
+                                           pos = --charloop_ending_pos;
                                          slice = inputSpan.Slice(pos);
                                          
-                                           CharLoopEnd1:
-                                           Utilities.StackPush(ref base.runstack!, ref stackpos, charloop_starting_pos1, charloop_ending_pos1, base.Crawlpos());
+                                           CharLoopEnd:
+                                           Utilities.StackPush(ref base.runstack!, ref stackpos, charloop_starting_pos, charloop_ending_pos, base.Crawlpos());
                                      //}
                                      
                                      // "attrval" capture group.
                                          
                                          // Match a character in the set [^%>\s] greedily any number of times.
                                          //{
-                                               charloop_starting_pos2 = pos;
+                                               charloop_starting_pos1 = pos;
                                              
                                              int iteration7 = 0;
                                              while ((uint)iteration7 < (uint)slice.Length && ((ch = slice[iteration7]) < 128 ? ("쇿\uffff\uffde뿿\uffff\uffff\uffff\uffff"[ch >> 4] & (1 << (ch & 0xF))) != 0 : RegexRunner.CharInClass((char)ch, "\u0001\u0004\u0001%&>?d")))
                                              slice = slice.Slice(iteration7);
                                              pos += iteration7;
                                              
-                                               charloop_ending_pos2 = pos;
-                                               goto CharLoopEnd2;
+                                               charloop_ending_pos1 = pos;
+                                               goto CharLoopEnd1;
                                              
-                                               CharLoopBacktrack2:
+                                               CharLoopBacktrack1:
                                              UncaptureUntil(base.runstack![--stackpos]);
-                                               Utilities.StackPop(base.runstack!, ref stackpos, out charloop_ending_pos2, out charloop_starting_pos2);
+                                               Utilities.StackPop(base.runstack!, ref stackpos, out charloop_ending_pos1, out charloop_starting_pos1);
                                              
                                              if (Utilities.s_hasTimeout)
                                              {
                                                  base.CheckTimeout();
                                              }
                                              
-                                               if (charloop_starting_pos2 >= charloop_ending_pos2)
+                                               if (charloop_starting_pos1 >= charloop_ending_pos1)
                                              {
-                                                   goto CharLoopBacktrack1;
+                                                   goto CharLoopBacktrack;
                                              }
-                                               pos = --charloop_ending_pos2;
+                                               pos = --charloop_ending_pos1;
                                              slice = inputSpan.Slice(pos);
                                              
-                                               CharLoopEnd2:
-                                               Utilities.StackPush(ref base.runstack!, ref stackpos, charloop_starting_pos2, charloop_ending_pos2, base.Crawlpos());
+                                               CharLoopEnd1:
+                                               Utilities.StackPush(ref base.runstack!, ref stackpos, charloop_starting_pos1, charloop_ending_pos1, base.Crawlpos());
                                          //}
                                          
                                          base.Capture(5, capture_starting_pos6, pos);
                                          
                                          Utilities.StackPush(ref base.runstack!, ref stackpos, capture_starting_pos6);
-                                           goto CaptureSkipBacktrack1;
+                                           goto CaptureSkipBacktrack;
                                          
-                                           CaptureBacktrack1:
+                                           CaptureBacktrack:
                                          capture_starting_pos6 = base.runstack![--stackpos];
-                                           goto CharLoopBacktrack2;
+                                           goto CharLoopBacktrack1;
                                          
-                                           CaptureSkipBacktrack1:;
+                                           CaptureSkipBacktrack:;
                                      //}
                                      
                                      Utilities.StackPush(ref base.runstack!, ref stackpos, 1, alternation_starting_pos, alternation_starting_capturepos);
                                              slice = inputSpan.Slice(pos);
                                              if (slice.IsEmpty || !char.IsWhiteSpace(slice[0]))
                                              {
-                                                   goto CaptureBacktrack;
+                                                   goto LoopIterationNoMatch;
                                              }
                                              pos++;
                                              slice = inputSpan.Slice(pos);
                                          base.Capture(5, capture_starting_pos8, pos);
                                          
                                          Utilities.StackPush(ref base.runstack!, ref stackpos, capture_starting_pos8);
-                                           goto CaptureSkipBacktrack2;
+                                           goto CaptureSkipBacktrack1;
                                          
-                                           CaptureBacktrack2:
+                                           CaptureBacktrack1:
                                          capture_starting_pos8 = base.runstack![--stackpos];
                                          goto CharLazyBacktrack;
                                          
-                                           CaptureSkipBacktrack2:;
+                                           CaptureSkipBacktrack1:;
                                      //}
                                      
                                      Utilities.StackPush(ref base.runstack!, ref stackpos, 2, alternation_starting_pos, alternation_starting_capturepos);
                                      case 0:
                                          goto AlternationBranch;
                                      case 1:
-                                           goto CaptureBacktrack1;
+                                           goto CaptureBacktrack;
                                      case 2:
-                                           goto CaptureBacktrack2;
+                                           goto CaptureBacktrack1;
                                  }
                                  
                                  AlternationMatch:;
                              base.Capture(2, capture_starting_pos2, pos);
                              
                              Utilities.StackPush(ref base.runstack!, ref stackpos, capture_starting_pos2);
-                               goto CaptureSkipBacktrack3;
+                               goto CaptureSkipBacktrack2;
                              
-                               CaptureBacktrack3:
+                               CaptureBacktrack2:
                              capture_starting_pos2 = base.runstack![--stackpos];
                              goto AlternationBacktrack;
                              
-                               CaptureSkipBacktrack3:;
+                               CaptureSkipBacktrack2:;
                          //}
                          
                          base.Capture(1, capture_starting_pos, pos);
                          
                          Utilities.StackPush(ref base.runstack!, ref stackpos, capture_starting_pos);
-                           goto CaptureSkipBacktrack4;
+                           goto CaptureSkipBacktrack3;
                          
-                           CaptureBacktrack4:
+                           CaptureBacktrack3:
                          capture_starting_pos = base.runstack![--stackpos];
-                           goto CaptureBacktrack3;
+                           goto CaptureBacktrack2;
                          
-                           CaptureSkipBacktrack4:;
+                           CaptureSkipBacktrack3:;
                      //}
                      
                      
                          UncaptureUntil(0);
                          return false; // The input didn't match.
                      }
-                       goto CaptureBacktrack4;
+                       goto CaptureBacktrack3;
                      LoopEnd:;
                  //}
"(\\B[A-Z]+?(?=[A-Z][^A-Z])|\\B[A-Z]+?(?=[^A-Z]))" (514 uses)
[GeneratedRegex("(\\B[A-Z]+?(?=[A-Z][^A-Z])|\\B[A-Z]+?(?=[^A-Z]))")]
  ///                 ○ Match a character in the set [A-Z].<br/>
  ///                 ○ Match a character in the set [^A-Z].<br/>
  ///         ○ Match a sequence of expressions.<br/>
-   ///             ○ Match a character in the set [A-Z] lazily at least once.<br/>
+   ///             ○ Match a character in the set [A-Z] atomically at least once.<br/>
  ///             ○ Zero-width positive lookahead.<br/>
  ///                 ○ Match a character in the set [^A-Z].<br/>
  /// </code>
                  int matchStart = pos;
                  int capture_starting_pos = 0;
                  int lazyloop_capturepos = 0;
-                   int lazyloop_capturepos1 = 0;
                  int lazyloop_pos = 0;
-                   int lazyloop_pos1 = 0;
                  int stackpos = 0;
                  ReadOnlySpan<char> slice = inputSpan.Slice(pos);
                  
                              
                              // Branch 1
                              {
-                                   // Match a character in the set [A-Z] lazily at least once.
-                                   //{
-                                       if (slice.IsEmpty || !char.IsAsciiLetterUpper(slice[0]))
+                                   // Match a character in the set [A-Z] atomically at least once.
+                                   {
+                                       int iteration = slice.IndexOfAnyExceptInRange('A', 'Z');
+                                       if (iteration < 0)
+                                       {
+                                           iteration = slice.Length;
+                                       }
+                                       
+                                       if (iteration == 0)
                                      {
                                          UncaptureUntil(0);
                                          return false; // The input didn't match.
                                      }
                                      
-                                       pos++;
-                                       slice = inputSpan.Slice(pos);
-                                       lazyloop_pos1 = pos;
-                                       goto LazyLoopEnd1;
-                                       
-                                       LazyLoopBacktrack1:
-                                       UncaptureUntil(lazyloop_capturepos1);
-                                       if (Utilities.s_hasTimeout)
-                                       {
-                                           base.CheckTimeout();
-                                       }
-                                       
-                                       pos = lazyloop_pos1;
-                                       slice = inputSpan.Slice(pos);
-                                       if (slice.IsEmpty || !char.IsAsciiLetterUpper(slice[0]))
-                                       {
-                                           UncaptureUntil(0);
-                                           return false; // The input didn't match.
-                                       }
-                                       pos++;
-                                       slice = inputSpan.Slice(pos);
-                                       lazyloop_pos1 = pos;
-                                       
-                                       LazyLoopEnd1:
-                                       lazyloop_capturepos1 = base.Crawlpos();
-                                   //}
+                                       slice = slice.Slice(iteration);
+                                       pos += iteration;
+                                   }
                                  
                                  // Zero-width positive lookahead.
                                  {
                                      // Match a character in the set [^A-Z].
                                      if (slice.IsEmpty || char.IsAsciiLetterUpper(slice[0]))
                                      {
-                                           goto LazyLoopBacktrack1;
+                                           UncaptureUntil(0);
+                                           return false; // The input didn't match.
                                      }
                                      
                                      pos = positivelookahead_starting_pos1;
"Video:\\s*([^,]*),\\s*([^,]*,?[^,]*?),?\\s*( ..." (3 uses)
[GeneratedRegex("Video:\\s*([^,]*),\\s*([^,]*,?[^,]*?),?\\s*(?=[0-9]*x[0-9]*)([0-9]*x[0-9]*)")]
  ///     ○ Match a character other than ',' greedily any number of times.<br/>
  ///     ○ Match ',' greedily, optionally.<br/>
  ///     ○ Match a character other than ',' lazily any number of times.<br/>
-   /// ○ Match ',' greedily, optionally.<br/>
-   /// ○ Match a whitespace character greedily any number of times.<br/>
+   /// ○ Match ',' atomically, optionally.<br/>
+   /// ○ Match a whitespace character atomically any number of times.<br/>
  /// ○ Zero-width positive lookahead.<br/>
  ///     ○ Match a character in the set [0-9] atomically any number of times.<br/>
  ///     ○ Match 'x'.<br/>
                  int charloop_capture_pos1 = 0;
                  int charloop_capture_pos2 = 0;
                  int charloop_capture_pos3 = 0;
-                   int charloop_capture_pos4 = 0;
-                   int charloop_capture_pos5 = 0;
                  int charloop_starting_pos = 0, charloop_ending_pos = 0;
                  int charloop_starting_pos1 = 0, charloop_ending_pos1 = 0;
                  int charloop_starting_pos2 = 0, charloop_ending_pos2 = 0;
                  int charloop_starting_pos3 = 0, charloop_ending_pos3 = 0;
-                   int charloop_starting_pos4 = 0, charloop_ending_pos4 = 0;
-                   int charloop_starting_pos5 = 0, charloop_ending_pos5 = 0;
                  int lazyloop_capturepos = 0;
                  int lazyloop_pos = 0;
                  ReadOnlySpan<char> slice = inputSpan.Slice(pos);
                      CaptureSkipBacktrack:;
                  //}
                  
-                   // Match ',' greedily, optionally.
-                   //{
-                       charloop_starting_pos4 = pos;
-                       
+                   // Match ',' atomically, optionally.
+                   {
                      if (!slice.IsEmpty && slice[0] == ',')
                      {
                          slice = slice.Slice(1);
                          pos++;
                      }
-                       
-                       charloop_ending_pos4 = pos;
-                       goto CharLoopEnd4;
-                       
-                       CharLoopBacktrack4:
-                       UncaptureUntil(charloop_capture_pos4);
-                       
-                       if (Utilities.s_hasTimeout)
-                       {
-                           base.CheckTimeout();
-                       }
-                       
-                       if (charloop_starting_pos4 >= charloop_ending_pos4)
-                       {
-                           goto CaptureBacktrack;
-                       }
-                       pos = --charloop_ending_pos4;
-                       slice = inputSpan.Slice(pos);
-                       
-                       CharLoopEnd4:
-                       charloop_capture_pos4 = base.Crawlpos();
-                   //}
+                   }
                  
-                   // Match a whitespace character greedily any number of times.
-                   //{
-                       charloop_starting_pos5 = pos;
-                       
+                   // Match a whitespace character atomically any number of times.
+                   {
                      int iteration4 = 0;
                      while ((uint)iteration4 < (uint)slice.Length && char.IsWhiteSpace(slice[iteration4]))
                      {
                      
                      slice = slice.Slice(iteration4);
                      pos += iteration4;
-                       
-                       charloop_ending_pos5 = pos;
-                       goto CharLoopEnd5;
-                       
-                       CharLoopBacktrack5:
-                       UncaptureUntil(charloop_capture_pos5);
-                       
-                       if (Utilities.s_hasTimeout)
-                       {
-                           base.CheckTimeout();
-                       }
-                       
-                       if (charloop_starting_pos5 >= charloop_ending_pos5)
-                       {
-                           goto CharLoopBacktrack4;
-                       }
-                       pos = --charloop_ending_pos5;
-                       slice = inputSpan.Slice(pos);
-                       
-                       CharLoopEnd5:
-                       charloop_capture_pos5 = base.Crawlpos();
-                   //}
+                   }
                  
                  // Zero-width positive lookahead.
                  {
                      // Match 'x'.
                      if (slice.IsEmpty || slice[0] != 'x')
                      {
-                           goto CharLoopBacktrack5;
+                           goto CaptureBacktrack;
                      }
                      
                      // Match a character in the set [0-9] atomically any number of times.
                      // Match 'x'.
                      if (slice.IsEmpty || slice[0] != 'x')
                      {
-                           goto CharLoopBacktrack5;
+                           goto CaptureBacktrack;
                      }
                      
                      // Match a character in the set [0-9] atomically any number of times.

For more diff examples, see https://gist.github.com/MihuBot/d7d06d3c45c0311af12f2d5a2fae1e6e

JIT assembly changes
Total bytes of base: 53914034
Total bytes of diff: 53911150
Total bytes of delta: -2884 (-0.01 % of base)
Total relative delta: -0.61
    diff is an improvement.
    relative diff is an improvement.

For a list of JIT diff improvements, see Improvements.md

Sample source code for further analysis
const string JsonPath = "RegexResults-1329.json";
if (!File.Exists(JsonPath))
{
    await using var archiveStream = await new HttpClient().GetStreamAsync("https://mihubot.xyz/r/E26rbsGA");
    using var archive = new ZipArchive(archiveStream, ZipArchiveMode.Read);
    archive.Entries.First(e => e.Name == "Results.json").ExtractToFile(JsonPath);
}

using FileStream jsonFileStream = File.OpenRead(JsonPath);
RegexEntry[] entries = JsonSerializer.Deserialize<RegexEntry[]>(jsonFileStream, new JsonSerializerOptions { IncludeFields = true })!;
Console.WriteLine($"Working with {entries.Length} patterns");



record KnownPattern(string Pattern, RegexOptions Options, int Count);

sealed class RegexEntry
{
    public required KnownPattern Regex { get; set; }
    public required string MainSource { get; set; }
    public required string PrSource { get; set; }
    public string? FullDiff { get; set; }
    public string? ShortDiff { get; set; }
    public (string Name, string Values)[]? SearchValuesOfChar { get; set; }
    public (string[] Values, StringComparison ComparisonType)[]? SearchValuesOfString { get; set; }
}

@stephentoub stephentoub enabled auto-merge (squash) July 31, 2025 02:05
@stephentoub stephentoub merged commit 2b52fb2 into dotnet:main Jul 31, 2025
85 of 87 checks passed
@github-actions github-actions bot locked and limited conversation to collaborators Aug 31, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants