-
Notifications
You must be signed in to change notification settings - Fork 5.2k
Ensure that bitwise operations of small masks can still be folded #117887
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR fixes an issue in the JIT compiler where bitwise operations on SIMD mask values with small element types were losing their type information during folding optimizations. The change ensures that when folding hardware intrinsic expressions involving convert operations, the original small mask type is preserved rather than being normalized to a larger integer type.
Key changes:
- Modified type size comparison logic to compare operand types directly instead of against a common base type
- Added preservation of the original SIMD base type from convert operations to maintain mask element count information
Comments suppressed due to low confidence (1)
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch |
@EgorBot -amd using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System.Buffers;
BenchmarkRunner.Run<SingleString>(args: args);
public class SingleString
{
private static readonly SearchValues<string> s_values = SearchValues.Create([Needle], StringComparison.Ordinal);
private static readonly SearchValues<string> s_valuesIC = SearchValues.Create([Needle], StringComparison.OrdinalIgnoreCase);
private static readonly string s_text_noMatches = new('a', Length);
private static readonly string s_text_falsePositives = string.Concat(Enumerable.Repeat("Sherlock Holm_s", Length / Needle.Length));
public const int Length = 100_000;
public const string Needle = "Sherlock Holmes";
[Benchmark] public void Throughput() => s_text_noMatches.AsSpan().Contains(Needle, StringComparison.Ordinal);
[Benchmark] public void SV_Throughput() => s_text_noMatches.AsSpan().ContainsAny(s_values);
[Benchmark] public void SV_ThroughputIC() => s_text_noMatches.AsSpan().ContainsAny(s_valuesIC);
[Benchmark] public void FalsePositives() => s_text_falsePositives.AsSpan().Contains(Needle, StringComparison.Ordinal);
[Benchmark] public void SV_FalsePositives() => s_text_falsePositives.AsSpan().ContainsAny(s_values);
[Benchmark] public void SV_FalsePositivesIC() => s_text_falsePositives.AsSpan().ContainsAny(s_valuesIC);
} |
This appears to be regressing that path currently: EgorBot/runtime-utils#445 (comment) |
I'm not seeing the same locally on my Intel or AMD boxes. What I am seeing is that these tests are touching enough memory that they are impacted fairly significantly by the data alignment and can vary by a decent amount across several separate runs. The work that BDN does to try and account for noise doesn't handle things like static readonly values that will get reused across multiple iterations, without being moved, and which will end up impacting the cache. |
For example, I get the following numbers on my 7900X Before
After
Different boxes report different numbers, but all are generally inline with the change being the same or faster and with smaller average codegen. The 7900X remains slightly pessimized for V512 related mask usage due to its double pumping, which makes the mask<->vector conversions a bit more visible. |
Here are 7 consecutive runs from my 9950x consistently showing a similar regression as reported by the bot for the Epyc 9V74. Or another run with |
I'd speculate there's something else at play here in the microbenchmarks then. I still cannot reproduce locally (Tiger Lake, Cascade Lake, Ice Lake, or Zen4) and this is a pure reduction in terms of uops, instructions, and code size. It follows the recommendations from the Intel/AMD optimization manuals and more closely matches what GCC/Clang/MSVC produce for similar code. |
What's lacking is still the handling of |
@EgorBot -aws_sapphirelake -azure_cascadelake using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System.Buffers;
BenchmarkRunner.Run<SingleString>(args: args);
public class SingleString
{
private static readonly SearchValues<string> s_values = SearchValues.Create([Needle], StringComparison.Ordinal);
private static readonly SearchValues<string> s_valuesIC = SearchValues.Create([Needle], StringComparison.OrdinalIgnoreCase);
private static readonly string s_text_noMatches = new('a', Length);
private static readonly string s_text_falsePositives = string.Concat(Enumerable.Repeat("Sherlock Holm_s", Length / Needle.Length));
public const int Length = 100_000;
public const string Needle = "Sherlock Holmes";
[Benchmark] public void Throughput() => s_text_noMatches.AsSpan().Contains(Needle, StringComparison.Ordinal);
[Benchmark] public void SV_Throughput() => s_text_noMatches.AsSpan().ContainsAny(s_values);
[Benchmark] public void SV_ThroughputIC() => s_text_noMatches.AsSpan().ContainsAny(s_valuesIC);
[Benchmark] public void FalsePositives() => s_text_falsePositives.AsSpan().Contains(Needle, StringComparison.Ordinal);
[Benchmark] public void SV_FalsePositives() => s_text_falsePositives.AsSpan().ContainsAny(s_values);
[Benchmark] public void SV_FalsePositivesIC() => s_text_falsePositives.AsSpan().ContainsAny(s_valuesIC);
} |
@EgorBo numbers for sapphire rapids/cascade lake look like what I saw, and what I see on my Zen4 and Tiger Lake boxes (minus the outlier you have for I think this is still something we should take. This is inline with the intended optimization the path should have already been doing, is an instruction count reduction, a micro-op reduction, a (as computed by LLVM MCA and uiCA) latency reduction, a reduction in assembly size, and matches the general optimization guidelines around AVX512 to avoid unnecessary conversions between mask and vector form. Even if there is a regression on Zen5, this would be micro-architecture specific and something that should get addressed over time, particularly as we finish the other remaining optimization work to ensure we consume the kmask directly in the subsequent branch tests (i.e. we generate |
For reference, uiCA reports this for the prior code:
While the new is:
If we also resolve the
|
This resolves an issue spotted in #117865 (comment)