-
Notifications
You must be signed in to change notification settings - Fork 5.2k
Coalesce adjacent equivalent anchors #118083
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Sometimes you see patterns where folks have put the same anchor multiple times in a row, e.g. `\b\b`. The subsequent anchors are nops and can just be removed.
@MihuBot regexdiff |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR implements optimization for regex patterns by coalescing adjacent equivalent anchors (e.g., \b\b
becomes \b
). This improves regex compilation performance by removing redundant anchor patterns that don't affect matching behavior.
Key changes:
- Adds logic to detect and remove consecutive identical anchor patterns during regex node reduction
- Adds comprehensive test coverage for various anchor types and combinations
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
File | Description |
---|---|
RegexNode.cs | Implements anchor coalescing logic in the regex reduction algorithm |
RegexReductionTests.cs | Adds test cases for anchor coalescing scenarios and edge cases |
Comments suppressed due to low confidence (1)
src/libraries/System.Text.RegularExpressions/tests/UnitTests/RegexReductionTests.cs:384
- This test case doesn't demonstrate anchor coalescing since it only contains a single
$
anchor. Consider using[InlineData(@"$$", @"$")]
to test coalescing of multiple end-of-string anchors.
[InlineData(@"$", @"$")]
src/libraries/System.Text.RegularExpressions/tests/UnitTests/RegexReductionTests.cs
Show resolved
Hide resolved
src/libraries/System.Text.RegularExpressions/tests/UnitTests/RegexReductionTests.cs
Show resolved
Hide resolved
src/libraries/System.Text.RegularExpressions/tests/UnitTests/RegexReductionTests.cs
Show resolved
Hide resolved
79 out of 18857 patterns have generated source code changes. Examples of GeneratedRegex source diffs"(?<desc>h|ampm|am\\b|a\\.m\\.|a m\\b|a\\. m\ ..." (114 uses)[GeneratedRegex("(?<desc>h|ampm|am\\b|a\\.m\\.|a m\\b|a\\. m\\.|a\\.m\\b|a\\. m\\b|pm\\b|p\\.m\\.|p m\\b|p\\. m\\.|p\\.m\\b|p\\. m\\b|p\\b\\b)", RegexOptions.IgnoreCase | RegexOptions.Singleline)] /// ○ Match a sequence of expressions.<br/>
/// ○ Match a character in the set [Pp].<br/>
/// ○ Match if at a word boundary.<br/>
- /// ○ Match if at a word boundary.<br/>
/// </code>
/// </remarks>
[global::System.CodeDom.Compiler.GeneratedCodeAttribute("System.Text.RegularExpressions.Generator", "42.42.42.42")]
return false; // The input didn't match.
}
- // Match if at a word boundary.
- if (!Utilities.IsBoundary(inputSpan, pos + 1))
- {
- UncaptureUntil(0);
- return false; // The input didn't match.
- }
-
pos++;
slice = inputSpan.Slice(pos);
} "^^(?<AmsNetId>((?<First>\\d{1,3})\\.(?<Secon ..." (50 uses)[GeneratedRegex("^^(?<AmsNetId>((?<First>\\d{1,3})\\.(?<Second>\\d{1,3})\\.(?<Third>\\d{1,3})\\.(?<Fourth>\\d{1,3})\\.(?<Fifth>\\d{1,3})\\.(?<Sixth>\\d{1,3})) | Local | Empty | LocalHost)(:(?<AdsPort>\\d+))?$$", RegexOptions.IgnoreCase | RegexOptions.IgnorePatternWhitespace | RegexOptions.CultureInvariant)] /// Explanation:<br/>
/// <code>
/// ○ Match if at the beginning of the string.<br/>
- /// ○ Match if at the beginning of the string.<br/>
/// ○ "AmsNetId" capture group.<br/>
/// ○ Match with 4 alternative expressions.<br/>
/// ○ 1st capture group.<br/>
/// ○ "AdsPort" capture group.<br/>
/// ○ Match a Unicode digit atomically at least once.<br/>
/// ○ Match if at the end of the string or if before an ending newline.<br/>
- /// ○ Match if at the end of the string or if before an ending newline.<br/>
/// </code>
/// </remarks>
[global::System.CodeDom.Compiler.GeneratedCodeAttribute("System.Text.RegularExpressions.Generator", "42.42.42.42")]
return false; // The input didn't match.
}
- // Match if at the beginning of the string.
- if (pos != 0)
- {
- UncaptureUntil(0);
- return false; // The input didn't match.
- }
-
// "AmsNetId" capture group.
//{
capture_starting_pos = pos;
goto LoopIterationNoMatch;
}
- // Match if at the end of the string or if before an ending newline.
- if (pos < inputSpan.Length - 1 || ((uint)pos < (uint)inputSpan.Length && inputSpan[pos] != '\n'))
- {
- goto LoopIterationNoMatch;
- }
-
// The input matched.
base.runtextpos = pos;
base.Capture(0, matchStart, pos); "\\s+([`~!@#$%^&*\\(\\)=+\\|\\[\\]{};':,.<>?< ..." (26 uses)[GeneratedRegex("\\s+([`~!@#$%^&*\\(\\)=+\\|\\[\\]{};':,.<>?<《》,。?;‘’“”:、¥!…()])+\\s+")] }
}
- /// <summary>Supports searching for characters in or not in "\t\n\v\f\r \u0085 \u2028\u2029 ".</summary>
- internal static readonly SearchValues<char> s_whitespace = SearchValues.Create("\t\n\v\f\r \u0085 \u2028\u2029 ");
+ /// <summary>Supports searching for characters in or not in "\t\n\v\f\r \u0085 \u2028\u2029 ".</summary>
+ internal static readonly SearchValues<char> s_whitespace = SearchValues.Create("\t\n\v\f\r \u0085 \u2028\u2029 ");
}
} "\\s+([`~!@#$%^&*\\(\\)\\-_=+\\\\|\\[\\]{};': ..." (26 uses)[GeneratedRegex("\\s+([`~!@#$%^&*\\(\\)\\-_=+\\\\|\\[\\]{};':,.<>/?<《》,。?;‘’“”:、—¥!…—()])+\\s+")] }
}
- /// <summary>Supports searching for characters in or not in "\t\n\v\f\r \u0085 \u2028\u2029 ".</summary>
- internal static readonly SearchValues<char> s_whitespace = SearchValues.Create("\t\n\v\f\r \u0085 \u2028\u2029 ");
+ /// <summary>Supports searching for characters in or not in "\t\n\v\f\r \u0085 \u2028\u2029 ".</summary>
+ internal static readonly SearchValues<char> s_whitespace = SearchValues.Create("\t\n\v\f\r \u0085 \u2028\u2029 ");
}
} "^\\s+(?<guid>\\{[0-9a-zA-Z]{8}-[0-9a-zA-Z]{4 ..." (26 uses)[GeneratedRegex("^\\s+(?<guid>\\{[0-9a-zA-Z]{8}-[0-9a-zA-Z]{4}-[0-9a-zA-Z]{4}-[0-9a-zA-Z]{4}-[0-9a-zA-Z]{12}\\})\\s+=\\s+(?<dep>\\{[0-9a-zA-Z]{8}-[0-9a-zA-Z]{4}-[0-9a-zA-Z]{4}-[0-9a-zA-Z]{4}-[0-9a-zA-Z]{12}\\})", RegexOptions.Multiline)] /// <summary>Whether <see cref="s_defaultTimeout"/> is non-infinite.</summary>
internal static readonly bool s_hasTimeout = s_defaultTimeout != Regex.InfiniteMatchTimeout;
- /// <summary>Supports searching for characters in or not in "\t\n\v\f\r \u0085 \u2028\u2029 ".</summary>
- internal static readonly SearchValues<char> s_whitespace = SearchValues.Create("\t\n\v\f\r \u0085 \u2028\u2029 ");
+ /// <summary>Supports searching for characters in or not in "\t\n\v\f\r \u0085 \u2028\u2029 ".</summary>
+ internal static readonly SearchValues<char> s_whitespace = SearchValues.Create("\t\n\v\f\r \u0085 \u2028\u2029 ");
}
} "(?<id>\\S+)" (21 uses)[GeneratedRegex("(?<id>\\S+)")] /// <summary>Whether <see cref="s_defaultTimeout"/> is non-infinite.</summary>
internal static readonly bool s_hasTimeout = s_defaultTimeout != Regex.InfiniteMatchTimeout;
- /// <summary>Supports searching for characters in or not in "\t\n\v\f\r \u0085 \u2028\u2029 ".</summary>
- internal static readonly SearchValues<char> s_whitespace = SearchValues.Create("\t\n\v\f\r \u0085 \u2028\u2029 ");
+ /// <summary>Supports searching for characters in or not in "\t\n\v\f\r \u0085 \u2028\u2029 ".</summary>
+ internal static readonly SearchValues<char> s_whitespace = SearchValues.Create("\t\n\v\f\r \u0085 \u2028\u2029 ");
}
} "(?<id>\\S+)(?<version>.*)" (21 uses)[GeneratedRegex("(?<id>\\S+)(?<version>.*)")] /// <summary>Whether <see cref="s_defaultTimeout"/> is non-infinite.</summary>
internal static readonly bool s_hasTimeout = s_defaultTimeout != Regex.InfiniteMatchTimeout;
- /// <summary>Supports searching for characters in or not in "\t\n\v\f\r \u0085 \u2028\u2029 ".</summary>
- internal static readonly SearchValues<char> s_whitespace = SearchValues.Create("\t\n\v\f\r \u0085 \u2028\u2029 ");
+ /// <summary>Supports searching for characters in or not in "\t\n\v\f\r \u0085 \u2028\u2029 ".</summary>
+ internal static readonly SearchValues<char> s_whitespace = SearchValues.Create("\t\n\v\f\r \u0085 \u2028\u2029 ");
}
} "\\s+" (18 uses)[GeneratedRegex("\\s+", RegexOptions.IgnoreCase | RegexOptions.CultureInvariant)] /// <summary>Whether <see cref="s_defaultTimeout"/> is non-infinite.</summary>
internal static readonly bool s_hasTimeout = s_defaultTimeout != Regex.InfiniteMatchTimeout;
- /// <summary>Supports searching for characters in or not in "\t\n\v\f\r \u0085 \u2028\u2029 ".</summary>
- internal static readonly SearchValues<char> s_whitespace = SearchValues.Create("\t\n\v\f\r \u0085 \u2028\u2029 ");
+ /// <summary>Supports searching for characters in or not in "\t\n\v\f\r \u0085 \u2028\u2029 ".</summary>
+ internal static readonly SearchValues<char> s_whitespace = SearchValues.Create("\t\n\v\f\r \u0085 \u2028\u2029 ");
}
} "\\s\\s+" (17 uses)[GeneratedRegex("\\s\\s+")] /// <summary>Whether <see cref="s_defaultTimeout"/> is non-infinite.</summary>
internal static readonly bool s_hasTimeout = s_defaultTimeout != Regex.InfiniteMatchTimeout;
- /// <summary>Supports searching for characters in or not in "\t\n\v\f\r \u0085 \u2028\u2029 ".</summary>
- internal static readonly SearchValues<char> s_whitespace = SearchValues.Create("\t\n\v\f\r \u0085 \u2028\u2029 ");
+ /// <summary>Supports searching for characters in or not in "\t\n\v\f\r \u0085 \u2028\u2029 ".</summary>
+ internal static readonly SearchValues<char> s_whitespace = SearchValues.Create("\t\n\v\f\r \u0085 \u2028\u2029 ");
}
} "(?<desc>h|ampm|am\\b|a\\.m\\.|a m\\b|a\\. m\ ..." (16 uses)[GeneratedRegex("(?<desc>h|ampm|am\\b|a\\.m\\.|a m\\b|a\\. m\\.|a\\.m\\b|a\\. m\\b|pm\\b|p\\.m\\.|p m\\b|p\\. m\\.|p\\.m\\b|p\\. m\\b|p\\b\\b)", RegexOptions.Singleline)] /// ○ Match a sequence of expressions.<br/>
/// ○ Match the string " m".<br/>
/// ○ Match if at a word boundary.<br/>
- /// ○ Match a sequence of expressions.<br/>
- /// ○ Match if at a word boundary.<br/>
- /// ○ Match if at a word boundary.<br/>
+ /// ○ Match if at a word boundary.<br/>
/// </code>
/// </remarks>
[global::System.CodeDom.Compiler.GeneratedCodeAttribute("System.Text.RegularExpressions.Generator", "42.42.42.42")]
return false; // The input didn't match.
}
- // Match if at a word boundary.
- if (!Utilities.IsBoundary(inputSpan, pos + 1))
- {
- UncaptureUntil(0);
- return false; // The input didn't match.
- }
-
pos++;
slice = inputSpan.Slice(pos);
} For more diff examples, see https://gist.github.com/MihuBot/2273def877179c54bafb17a692e2f31d
For a list of JIT diff improvements, see Improvements.md Sample source code for further analysisconst string JsonPath = "RegexResults-1284.json";
if (!File.Exists(JsonPath))
{
await using var archiveStream = await new HttpClient().GetStreamAsync("https://mihubot.xyz/r/E2jppW7A");
using var archive = new ZipArchive(archiveStream, ZipArchiveMode.Read);
archive.Entries.First(e => e.Name == "Results.json").ExtractToFile(JsonPath);
}
using FileStream jsonFileStream = File.OpenRead(JsonPath);
RegexEntry[] entries = JsonSerializer.Deserialize<RegexEntry[]>(jsonFileStream, new JsonSerializerOptions { IncludeFields = true })!;
Console.WriteLine($"Working with {entries.Length} patterns");
record KnownPattern(string Pattern, RegexOptions Options, int Count);
sealed class RegexEntry
{
public required KnownPattern Regex { get; set; }
public required string MainSource { get; set; }
public required string PrSource { get; set; }
public string? FullDiff { get; set; }
public string? ShortDiff { get; set; }
public (string Name, string Values)[]? SearchValuesOfChar { get; set; }
public (string[] Values, StringComparison ComparisonType)[]? SearchValuesOfString { get; set; }
} |
@MihaZupan, in your bot's results, what's changing in some of these patterns, e.g. "\s\s+" ? |
Looks like one of the spaces is changing for some reason. internal static readonly SearchValues<char> s_whitespace = SearchValues.Create("\t\n\v\f\r \u0085 \u2028\u2029 ");
|
Hopefully. I'm a little worried your tooling is correct and there's some sort of issue in the generator. |
Something weird is going on here. I've added test logic into the generator: MihaZupan@9389bed With chars:
note the |
Oh actually it makes sense, we're using the same Line 1023 in 197f38a
so with multithreaded access the Sort is corrupting the array.
|
OK, so a real bug. Nice job tracking it down. |
Sometimes you see patterns where folks have put the same anchor multiple times in a row, e.g.
\b\b
. The subsequent anchors are nops and can just be removed.