-
Notifications
You must be signed in to change notification settings - Fork 5.2k
Tweak positive lookahead handling in CanBeMadeAtomic #118221
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I was overthinking things before, and the handling for positive lookaheads can be simplified and expanded. The logic was saying that a loop followed by an positive lookahead could be made atomic if it was disjoint not only from the lookahead's contents but also from what followed the lookahead. However, the latter isn't actually necessary. If the loop is disjoint with the lookahead, then it's also disjoint with the intersection of the lookahead and what follows / overlaps with the lookahead.
@MihuBot regexdiff |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR simplifies and improves the logic for positive lookahead handling in the CanBeMadeAtomic
method within the regex reduction optimization system. The change eliminates unnecessary complexity by removing the requirement that a loop must be disjoint from both the lookahead contents AND what follows the lookahead.
Key changes:
- Simplified positive lookahead handling logic in
CanBeMadeAtomic
- Added support for skipping through positive lookaheads during atomic optimization analysis
- Expanded test coverage with additional regex reduction test cases
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
File | Description |
---|---|
RegexNode.cs | Simplified positive lookahead handling by removing redundant disjointness check and adding logic to skip through positive lookaheads |
RegexReductionTests.cs | Added new test cases covering positive lookahead scenarios and corrected an existing test expectation |
src/libraries/System.Text.RegularExpressions/tests/UnitTests/RegexReductionTests.cs
Outdated
Show resolved
Hide resolved
Tagging subscribers to this area: @dotnet/area-system-text-regularexpressions |
7 out of 18857 patterns have generated source code changes. Examples of GeneratedRegex source diffs"\\G\\{%(\\s*(?<attrname>\\w+(?=\\W))(\\s*(?< ..." (1751 uses)[GeneratedRegex("\\G\\{%(\\s*(?<attrname>\\w+(?=\\W))(\\s*(?<equal>=)\\s*'(?<attrval>[^']*)'|\\s*(?<equal>=)\\s*(?<attrval>[^\\s%>]*)|(?<equal>)(?<attrval>\\s*?)))*\\s*?%\\}")] /// ○ 1st capture group.<br/>
/// ○ Match a whitespace character atomically any number of times.<br/>
/// ○ "attrname" capture group.<br/>
- /// ○ Match a word character greedily at least once.<br/>
+ /// ○ Match a word character atomically at least once.<br/>
/// ○ Zero-width positive lookahead.<br/>
/// ○ Match any character other than a word character.<br/>
/// ○ 2nd capture group.<br/>
char ch;
int charloop_starting_pos = 0, charloop_ending_pos = 0;
int charloop_starting_pos1 = 0, charloop_ending_pos1 = 0;
- int charloop_starting_pos2 = 0, charloop_ending_pos2 = 0;
int lazyloop_capturepos = 0;
int lazyloop_pos = 0;
int loop_iteration = 0;
}
// "attrname" capture group.
- //{
+ {
int capture_starting_pos1 = pos;
- // Match a word character greedily at least once.
- //{
- charloop_starting_pos = pos;
-
+ // Match a word character atomically at least once.
+ {
int iteration1 = 0;
while ((uint)iteration1 < (uint)slice.Length && Utilities.IsWordChar(slice[iteration1]))
{
slice = slice.Slice(iteration1);
pos += iteration1;
-
- charloop_ending_pos = pos;
- charloop_starting_pos++;
- goto CharLoopEnd;
-
- CharLoopBacktrack:
- UncaptureUntil(base.runstack![--stackpos]);
- Utilities.StackPop(base.runstack!, ref stackpos, out charloop_ending_pos, out charloop_starting_pos);
-
- if (Utilities.s_hasTimeout)
- {
- base.CheckTimeout();
- }
-
- if (charloop_starting_pos >= charloop_ending_pos)
- {
- goto LoopIterationNoMatch;
- }
- pos = --charloop_ending_pos;
- slice = inputSpan.Slice(pos);
-
- CharLoopEnd:
- Utilities.StackPush(ref base.runstack!, ref stackpos, charloop_starting_pos, charloop_ending_pos, base.Crawlpos());
- //}
+ }
// Zero-width positive lookahead.
{
// Match any character other than a word character.
if (slice.IsEmpty || Utilities.IsWordChar(slice[0]))
{
- goto CharLoopBacktrack;
+ goto LoopIterationNoMatch;
}
pos = positivelookahead_starting_pos;
}
base.Capture(3, capture_starting_pos1, pos);
-
- Utilities.StackPush(ref base.runstack!, ref stackpos, capture_starting_pos1);
- goto CaptureSkipBacktrack;
-
- CaptureBacktrack:
- capture_starting_pos1 = base.runstack![--stackpos];
- goto CharLoopBacktrack;
-
- CaptureSkipBacktrack:;
- //}
+ }
// 2nd capture group.
//{
// Match a whitespace character greedily any number of times.
//{
- charloop_starting_pos1 = pos;
+ charloop_starting_pos = pos;
int iteration6 = 0;
while ((uint)iteration6 < (uint)slice.Length && char.IsWhiteSpace(slice[iteration6]))
slice = slice.Slice(iteration6);
pos += iteration6;
- charloop_ending_pos1 = pos;
- goto CharLoopEnd1;
+ charloop_ending_pos = pos;
+ goto CharLoopEnd;
- CharLoopBacktrack1:
+ CharLoopBacktrack:
UncaptureUntil(base.runstack![--stackpos]);
- Utilities.StackPop(base.runstack!, ref stackpos, out charloop_ending_pos1, out charloop_starting_pos1);
+ Utilities.StackPop(base.runstack!, ref stackpos, out charloop_ending_pos, out charloop_starting_pos);
if (Utilities.s_hasTimeout)
{
base.CheckTimeout();
}
- if (charloop_starting_pos1 >= charloop_ending_pos1)
+ if (charloop_starting_pos >= charloop_ending_pos)
{
goto AlternationBranch1;
}
- pos = --charloop_ending_pos1;
+ pos = --charloop_ending_pos;
slice = inputSpan.Slice(pos);
- CharLoopEnd1:
- Utilities.StackPush(ref base.runstack!, ref stackpos, charloop_starting_pos1, charloop_ending_pos1, base.Crawlpos());
+ CharLoopEnd:
+ Utilities.StackPush(ref base.runstack!, ref stackpos, charloop_starting_pos, charloop_ending_pos, base.Crawlpos());
//}
// "attrval" capture group.
// Match a character in the set [^%>\s] greedily any number of times.
//{
- charloop_starting_pos2 = pos;
+ charloop_starting_pos1 = pos;
int iteration7 = 0;
while ((uint)iteration7 < (uint)slice.Length && ((ch = slice[iteration7]) < 128 ? ("쇿\uffff\uffde뿿\uffff\uffff\uffff\uffff"[ch >> 4] & (1 << (ch & 0xF))) != 0 : RegexRunner.CharInClass((char)ch, "\u0001\u0004\u0001%&>?d")))
slice = slice.Slice(iteration7);
pos += iteration7;
- charloop_ending_pos2 = pos;
- goto CharLoopEnd2;
+ charloop_ending_pos1 = pos;
+ goto CharLoopEnd1;
- CharLoopBacktrack2:
+ CharLoopBacktrack1:
UncaptureUntil(base.runstack![--stackpos]);
- Utilities.StackPop(base.runstack!, ref stackpos, out charloop_ending_pos2, out charloop_starting_pos2);
+ Utilities.StackPop(base.runstack!, ref stackpos, out charloop_ending_pos1, out charloop_starting_pos1);
if (Utilities.s_hasTimeout)
{
base.CheckTimeout();
}
- if (charloop_starting_pos2 >= charloop_ending_pos2)
+ if (charloop_starting_pos1 >= charloop_ending_pos1)
{
- goto CharLoopBacktrack1;
+ goto CharLoopBacktrack;
}
- pos = --charloop_ending_pos2;
+ pos = --charloop_ending_pos1;
slice = inputSpan.Slice(pos);
- CharLoopEnd2:
- Utilities.StackPush(ref base.runstack!, ref stackpos, charloop_starting_pos2, charloop_ending_pos2, base.Crawlpos());
+ CharLoopEnd1:
+ Utilities.StackPush(ref base.runstack!, ref stackpos, charloop_starting_pos1, charloop_ending_pos1, base.Crawlpos());
//}
base.Capture(5, capture_starting_pos6, pos);
Utilities.StackPush(ref base.runstack!, ref stackpos, capture_starting_pos6);
- goto CaptureSkipBacktrack1;
+ goto CaptureSkipBacktrack;
- CaptureBacktrack1:
+ CaptureBacktrack:
capture_starting_pos6 = base.runstack![--stackpos];
- goto CharLoopBacktrack2;
+ goto CharLoopBacktrack1;
- CaptureSkipBacktrack1:;
+ CaptureSkipBacktrack:;
//}
Utilities.StackPush(ref base.runstack!, ref stackpos, 1, alternation_starting_pos, alternation_starting_capturepos);
slice = inputSpan.Slice(pos);
if (slice.IsEmpty || !char.IsWhiteSpace(slice[0]))
{
- goto CaptureBacktrack;
+ goto LoopIterationNoMatch;
}
pos++;
slice = inputSpan.Slice(pos);
base.Capture(5, capture_starting_pos8, pos);
Utilities.StackPush(ref base.runstack!, ref stackpos, capture_starting_pos8);
- goto CaptureSkipBacktrack2;
+ goto CaptureSkipBacktrack1;
- CaptureBacktrack2:
+ CaptureBacktrack1:
capture_starting_pos8 = base.runstack![--stackpos];
goto CharLazyBacktrack;
- CaptureSkipBacktrack2:;
+ CaptureSkipBacktrack1:;
//}
Utilities.StackPush(ref base.runstack!, ref stackpos, 2, alternation_starting_pos, alternation_starting_capturepos);
case 0:
goto AlternationBranch;
case 1:
- goto CaptureBacktrack1;
+ goto CaptureBacktrack;
case 2:
- goto CaptureBacktrack2;
+ goto CaptureBacktrack1;
}
AlternationMatch:;
base.Capture(2, capture_starting_pos2, pos);
Utilities.StackPush(ref base.runstack!, ref stackpos, capture_starting_pos2);
- goto CaptureSkipBacktrack3;
+ goto CaptureSkipBacktrack2;
- CaptureBacktrack3:
+ CaptureBacktrack2:
capture_starting_pos2 = base.runstack![--stackpos];
goto AlternationBacktrack;
- CaptureSkipBacktrack3:;
+ CaptureSkipBacktrack2:;
//}
base.Capture(1, capture_starting_pos, pos);
Utilities.StackPush(ref base.runstack!, ref stackpos, capture_starting_pos);
- goto CaptureSkipBacktrack4;
+ goto CaptureSkipBacktrack3;
- CaptureBacktrack4:
+ CaptureBacktrack3:
capture_starting_pos = base.runstack![--stackpos];
- goto CaptureBacktrack3;
+ goto CaptureBacktrack2;
- CaptureSkipBacktrack4:;
+ CaptureSkipBacktrack3:;
//}
UncaptureUntil(0);
return false; // The input didn't match.
}
- goto CaptureBacktrack4;
+ goto CaptureBacktrack3;
LoopEnd:;
//} "(\\B[A-Z]+?(?=[A-Z][^A-Z])|\\B[A-Z]+?(?=[^A-Z]))" (514 uses)[GeneratedRegex("(\\B[A-Z]+?(?=[A-Z][^A-Z])|\\B[A-Z]+?(?=[^A-Z]))")] /// ○ Match a character in the set [A-Z].<br/>
/// ○ Match a character in the set [^A-Z].<br/>
/// ○ Match a sequence of expressions.<br/>
- /// ○ Match a character in the set [A-Z] lazily at least once.<br/>
+ /// ○ Match a character in the set [A-Z] atomically at least once.<br/>
/// ○ Zero-width positive lookahead.<br/>
/// ○ Match a character in the set [^A-Z].<br/>
/// </code>
int matchStart = pos;
int capture_starting_pos = 0;
int lazyloop_capturepos = 0;
- int lazyloop_capturepos1 = 0;
int lazyloop_pos = 0;
- int lazyloop_pos1 = 0;
int stackpos = 0;
ReadOnlySpan<char> slice = inputSpan.Slice(pos);
// Branch 1
{
- // Match a character in the set [A-Z] lazily at least once.
- //{
- if (slice.IsEmpty || !char.IsAsciiLetterUpper(slice[0]))
+ // Match a character in the set [A-Z] atomically at least once.
+ {
+ int iteration = slice.IndexOfAnyExceptInRange('A', 'Z');
+ if (iteration < 0)
+ {
+ iteration = slice.Length;
+ }
+
+ if (iteration == 0)
{
UncaptureUntil(0);
return false; // The input didn't match.
}
- pos++;
- slice = inputSpan.Slice(pos);
- lazyloop_pos1 = pos;
- goto LazyLoopEnd1;
-
- LazyLoopBacktrack1:
- UncaptureUntil(lazyloop_capturepos1);
- if (Utilities.s_hasTimeout)
- {
- base.CheckTimeout();
- }
-
- pos = lazyloop_pos1;
- slice = inputSpan.Slice(pos);
- if (slice.IsEmpty || !char.IsAsciiLetterUpper(slice[0]))
- {
- UncaptureUntil(0);
- return false; // The input didn't match.
- }
- pos++;
- slice = inputSpan.Slice(pos);
- lazyloop_pos1 = pos;
-
- LazyLoopEnd1:
- lazyloop_capturepos1 = base.Crawlpos();
- //}
+ slice = slice.Slice(iteration);
+ pos += iteration;
+ }
// Zero-width positive lookahead.
{
// Match a character in the set [^A-Z].
if (slice.IsEmpty || char.IsAsciiLetterUpper(slice[0]))
{
- goto LazyLoopBacktrack1;
+ UncaptureUntil(0);
+ return false; // The input didn't match.
}
pos = positivelookahead_starting_pos1; "Video:\\s*([^,]*),\\s*([^,]*,?[^,]*?),?\\s*( ..." (3 uses)[GeneratedRegex("Video:\\s*([^,]*),\\s*([^,]*,?[^,]*?),?\\s*(?=[0-9]*x[0-9]*)([0-9]*x[0-9]*)")] /// ○ Match a character other than ',' greedily any number of times.<br/>
/// ○ Match ',' greedily, optionally.<br/>
/// ○ Match a character other than ',' lazily any number of times.<br/>
- /// ○ Match ',' greedily, optionally.<br/>
- /// ○ Match a whitespace character greedily any number of times.<br/>
+ /// ○ Match ',' atomically, optionally.<br/>
+ /// ○ Match a whitespace character atomically any number of times.<br/>
/// ○ Zero-width positive lookahead.<br/>
/// ○ Match a character in the set [0-9] atomically any number of times.<br/>
/// ○ Match 'x'.<br/>
int charloop_capture_pos1 = 0;
int charloop_capture_pos2 = 0;
int charloop_capture_pos3 = 0;
- int charloop_capture_pos4 = 0;
- int charloop_capture_pos5 = 0;
int charloop_starting_pos = 0, charloop_ending_pos = 0;
int charloop_starting_pos1 = 0, charloop_ending_pos1 = 0;
int charloop_starting_pos2 = 0, charloop_ending_pos2 = 0;
int charloop_starting_pos3 = 0, charloop_ending_pos3 = 0;
- int charloop_starting_pos4 = 0, charloop_ending_pos4 = 0;
- int charloop_starting_pos5 = 0, charloop_ending_pos5 = 0;
int lazyloop_capturepos = 0;
int lazyloop_pos = 0;
ReadOnlySpan<char> slice = inputSpan.Slice(pos);
CaptureSkipBacktrack:;
//}
- // Match ',' greedily, optionally.
- //{
- charloop_starting_pos4 = pos;
-
+ // Match ',' atomically, optionally.
+ {
if (!slice.IsEmpty && slice[0] == ',')
{
slice = slice.Slice(1);
pos++;
}
-
- charloop_ending_pos4 = pos;
- goto CharLoopEnd4;
-
- CharLoopBacktrack4:
- UncaptureUntil(charloop_capture_pos4);
-
- if (Utilities.s_hasTimeout)
- {
- base.CheckTimeout();
- }
-
- if (charloop_starting_pos4 >= charloop_ending_pos4)
- {
- goto CaptureBacktrack;
- }
- pos = --charloop_ending_pos4;
- slice = inputSpan.Slice(pos);
-
- CharLoopEnd4:
- charloop_capture_pos4 = base.Crawlpos();
- //}
+ }
- // Match a whitespace character greedily any number of times.
- //{
- charloop_starting_pos5 = pos;
-
+ // Match a whitespace character atomically any number of times.
+ {
int iteration4 = 0;
while ((uint)iteration4 < (uint)slice.Length && char.IsWhiteSpace(slice[iteration4]))
{
slice = slice.Slice(iteration4);
pos += iteration4;
-
- charloop_ending_pos5 = pos;
- goto CharLoopEnd5;
-
- CharLoopBacktrack5:
- UncaptureUntil(charloop_capture_pos5);
-
- if (Utilities.s_hasTimeout)
- {
- base.CheckTimeout();
- }
-
- if (charloop_starting_pos5 >= charloop_ending_pos5)
- {
- goto CharLoopBacktrack4;
- }
- pos = --charloop_ending_pos5;
- slice = inputSpan.Slice(pos);
-
- CharLoopEnd5:
- charloop_capture_pos5 = base.Crawlpos();
- //}
+ }
// Zero-width positive lookahead.
{
// Match 'x'.
if (slice.IsEmpty || slice[0] != 'x')
{
- goto CharLoopBacktrack5;
+ goto CaptureBacktrack;
}
// Match a character in the set [0-9] atomically any number of times.
// Match 'x'.
if (slice.IsEmpty || slice[0] != 'x')
{
- goto CharLoopBacktrack5;
+ goto CaptureBacktrack;
}
// Match a character in the set [0-9] atomically any number of times. For more diff examples, see https://gist.github.com/MihuBot/d7d06d3c45c0311af12f2d5a2fae1e6e JIT assembly changes
For a list of JIT diff improvements, see Improvements.md Sample source code for further analysisconst string JsonPath = "RegexResults-1329.json";
if (!File.Exists(JsonPath))
{
await using var archiveStream = await new HttpClient().GetStreamAsync("https://mihubot.xyz/r/E26rbsGA");
using var archive = new ZipArchive(archiveStream, ZipArchiveMode.Read);
archive.Entries.First(e => e.Name == "Results.json").ExtractToFile(JsonPath);
}
using FileStream jsonFileStream = File.OpenRead(JsonPath);
RegexEntry[] entries = JsonSerializer.Deserialize<RegexEntry[]>(jsonFileStream, new JsonSerializerOptions { IncludeFields = true })!;
Console.WriteLine($"Working with {entries.Length} patterns");
record KnownPattern(string Pattern, RegexOptions Options, int Count);
sealed class RegexEntry
{
public required KnownPattern Regex { get; set; }
public required string MainSource { get; set; }
public required string PrSource { get; set; }
public string? FullDiff { get; set; }
public string? ShortDiff { get; set; }
public (string Name, string Values)[]? SearchValuesOfChar { get; set; }
public (string[] Values, StringComparison ComparisonType)[]? SearchValuesOfString { get; set; }
} |
I was overthinking things before, and the handling for positive lookaheads can be simplified and expanded. The logic was saying that a loop followed by an positive lookahead could be made atomic if it was disjoint not only from the lookahead's contents but also from what followed the lookahead. However, the latter isn't actually necessary. If the loop is disjoint with the lookahead, then it's also disjoint with the intersection of the lookahead and what follows / overlaps with the lookahead.