Ensure that bitwise operations of small masks can still be folded #117887

tannergooding · 2025-07-21T15:55:09Z

This resolves an issue spotted in #117865 (comment)

Copilot

Pull Request Overview

This PR fixes an issue in the JIT compiler where bitwise operations on SIMD mask values with small element types were losing their type information during folding optimizations. The change ensures that when folding hardware intrinsic expressions involving convert operations, the original small mask type is preserved rather than being normalized to a larger integer type.

Key changes:

Modified type size comparison logic to compare operand types directly instead of against a common base type
Added preservation of the original SIMD base type from convert operations to maintain mask element count information

Comments suppressed due to low confidence (1)

dotnet-policy-service · 2025-07-21T15:55:58Z

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

MihaZupan · 2025-07-21T16:11:31Z

@EgorBot -amd

using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System.Buffers;

BenchmarkRunner.Run<SingleString>(args: args);

public class SingleString
{
    private static readonly SearchValues<string> s_values = SearchValues.Create([Needle], StringComparison.Ordinal);
    private static readonly SearchValues<string> s_valuesIC = SearchValues.Create([Needle], StringComparison.OrdinalIgnoreCase);
    private static readonly string s_text_noMatches = new('a', Length);
    private static readonly string s_text_falsePositives = string.Concat(Enumerable.Repeat("Sherlock Holm_s", Length / Needle.Length));

    public const int Length = 100_000;
    public const string Needle = "Sherlock Holmes";

    [Benchmark] public void Throughput() => s_text_noMatches.AsSpan().Contains(Needle, StringComparison.Ordinal);
    [Benchmark] public void SV_Throughput() => s_text_noMatches.AsSpan().ContainsAny(s_values);
    [Benchmark] public void SV_ThroughputIC() => s_text_noMatches.AsSpan().ContainsAny(s_valuesIC);

    [Benchmark] public void FalsePositives() => s_text_falsePositives.AsSpan().Contains(Needle, StringComparison.Ordinal);
    [Benchmark] public void SV_FalsePositives() => s_text_falsePositives.AsSpan().ContainsAny(s_values);
    [Benchmark] public void SV_FalsePositivesIC() => s_text_falsePositives.AsSpan().ContainsAny(s_valuesIC);
}

MihaZupan · 2025-07-21T16:43:18Z

This appears to be regressing that path currently: EgorBot/runtime-utils#445 (comment)

tannergooding · 2025-07-21T17:11:57Z

I'm not seeing the same locally on my Intel or AMD boxes.

What I am seeing is that these tests are touching enough memory that they are impacted fairly significantly by the data alignment and can vary by a decent amount across several separate runs. The work that BDN does to try and account for noise doesn't handle things like static readonly values that will get reused across multiple iterations, without being moved, and which will end up impacting the cache.

tannergooding · 2025-07-21T17:18:11Z

For example, I get the following numbers on my 7900X

Before

Method	Mean	Error	StdDev	Median	Min	Max	Allocated
Throughput	3.174 us	0.0816 us	0.0940 us	3.135 us	3.063 us	3.315 us	-
SV_Throughput	3.967 us	0.0220 us	0.0172 us	3.964 us	3.947 us	4.004 us	-
SV_ThroughputIC	4.519 us	0.0854 us	0.0799 us	4.549 us	4.431 us	4.614 us	-
FalsePositives	11.598 us	0.1648 us	0.1542 us	11.598 us	11.383 us	11.819 us	-
SV_FalsePositives	9.393 us	0.1852 us	0.1819 us	9.468 us	9.058 us	9.696 us	-
SV_FalsePositivesIC	10.724 us	0.1985 us	0.1857 us	10.705 us	10.507 us	10.984 us	-

After

Method	Mean	Error	StdDev	Median	Min	Max	Allocated
Throughput	3.042 us	0.0342 us	0.0303 us	3.032 us	3.005 us	3.105 us	-
SV_Throughput	3.956 us	0.0058 us	0.0045 us	3.956 us	3.947 us	3.962 us	-
SV_ThroughputIC	4.647 us	0.0276 us	0.0215 us	4.651 us	4.611 us	4.682 us	-
FalsePositives	11.619 us	0.2319 us	0.2169 us	11.543 us	11.327 us	12.075 us	-
SV_FalsePositives	9.513 us	0.1498 us	0.1328 us	9.518 us	9.356 us	9.758 us	-
SV_FalsePositivesIC	10.930 us	0.0526 us	0.0439 us	10.928 us	10.874 us	11.026 us	-

Different boxes report different numbers, but all are generally inline with the change being the same or faster and with smaller average codegen. The 7900X remains slightly pessimized for V512 related mask usage due to its double pumping, which makes the mask<->vector conversions a bit more visible.

MihaZupan · 2025-07-21T18:06:37Z

can vary by a decent amount across several separate runs

Here are 7 consecutive runs from my 9950x consistently showing a similar regression as reported by the bot for the Epyc 9V74.
https://gist.github.com/MihaZupan/9cfd8e479bb95e8cd174e9704d6731ab

Or another run with LongJob (so multiple restarts) on the same Zen 4 Epyc: MihuBot/runtime-utils#1252

tannergooding · 2025-07-21T18:48:19Z

Here are 7 consecutive runs from my 9950x consistently showing a similar regression as reported by the bot for the Epyc 9V74.

I'd speculate there's something else at play here in the microbenchmarks then.

I still cannot reproduce locally (Tiger Lake, Cascade Lake, Ice Lake, or Zen4) and this is a pure reduction in terms of uops, instructions, and code size. It follows the recommendations from the Intel/AMD optimization manuals and more closely matches what GCC/Clang/MSVC produce for similar code.

tannergooding · 2025-07-21T19:24:21Z

more closely matches what GCC/Clang/MSVC produce for similar code.

What's lacking is still the handling of vpmovm2*; vptest and similar; which is a bit of a more involved change as indicated above.

EgorBo · 2025-07-21T19:50:26Z

@EgorBot -aws_sapphirelake -azure_cascadelake

using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System.Buffers;

BenchmarkRunner.Run<SingleString>(args: args);

public class SingleString
{
    private static readonly SearchValues<string> s_values = SearchValues.Create([Needle], StringComparison.Ordinal);
    private static readonly SearchValues<string> s_valuesIC = SearchValues.Create([Needle], StringComparison.OrdinalIgnoreCase);
    private static readonly string s_text_noMatches = new('a', Length);
    private static readonly string s_text_falsePositives = string.Concat(Enumerable.Repeat("Sherlock Holm_s", Length / Needle.Length));

    public const int Length = 100_000;
    public const string Needle = "Sherlock Holmes";

    [Benchmark] public void Throughput() => s_text_noMatches.AsSpan().Contains(Needle, StringComparison.Ordinal);
    [Benchmark] public void SV_Throughput() => s_text_noMatches.AsSpan().ContainsAny(s_values);
    [Benchmark] public void SV_ThroughputIC() => s_text_noMatches.AsSpan().ContainsAny(s_valuesIC);

    [Benchmark] public void FalsePositives() => s_text_falsePositives.AsSpan().Contains(Needle, StringComparison.Ordinal);
    [Benchmark] public void SV_FalsePositives() => s_text_falsePositives.AsSpan().ContainsAny(s_values);
    [Benchmark] public void SV_FalsePositivesIC() => s_text_falsePositives.AsSpan().ContainsAny(s_valuesIC);
}

tannergooding · 2025-07-22T15:42:45Z

@EgorBo numbers for sapphire rapids/cascade lake look like what I saw, and what I see on my Zen4 and Tiger Lake boxes (minus the outlier you have for FalsePositives on cascade lake).

I think this is still something we should take. This is inline with the intended optimization the path should have already been doing, is an instruction count reduction, a micro-op reduction, a (as computed by LLVM MCA and uiCA) latency reduction, a reduction in assembly size, and matches the general optimization guidelines around AVX512 to avoid unnecessary conversions between mask and vector form.

Even if there is a regression on Zen5, this would be micro-architecture specific and something that should get addressed over time, particularly as we finish the other remaining optimization work to ensure we consume the kmask directly in the subsequent branch tests (i.e. we generate kortest directly, rather than vpmovm2*; vptest)

tannergooding · 2025-07-22T16:07:19Z

For reference, uiCA reports this for the prior code:

Throughput (in cycles per iteration): 4.62
Bottleneck: Scheduling

The following throughputs could be achieved if the given property were the only bottleneck:

  - Predecoder: 3.50
  - Decoder: 4.00
  - Issue: 2.40
  - Ports: 4.50

┌───────────────────────┬────────┬───────┬─────────────────────────────────────────────────────────────────────────────────────────┐
│ MITE   MS   DSB   LSD │ Issued │ Exec. │ Port 0   Port 1   Port 2   Port 3   Port 4   Port 5   Port 6   Port 7   Port 8   Port 9 │
├───────────────────────┼────────┼───────┼─────────────────────────────────────────────────────────────────────────────────────────┤
│  2                    │   2    │   2   │                    0.51     0.49               1                                        │ [vpcmpeqw k1, zmm6, zmmword ptr [rsi]](https://www.uops.info/html-instr/VPCMPEQW_K_ZMM_M512.html)
│  1                    │   1    │   1   │  0.98                                         0.02                                      │ [vpmovm2w zmm0, k1](https://www.uops.info/html-instr/VPMOVM2W_ZMM_K.html)
│  2                    │   2    │   2   │                    0.49     0.51               1                                        │ [vpcmpeqw k1, zmm7, zmmword ptr [rsi+r14*1]](https://www.uops.info/html-instr/VPCMPEQW_K_ZMM_M512.html)
│  1                    │   1    │   1   │  0.96                                         0.04                                      │ [vpmovm2w zmm1, k1](https://www.uops.info/html-instr/VPMOVM2W_ZMM_K.html)
│  2                    │   2    │   2   │                    0.51     0.49               1                                        │ [vpcmpeqw k1, zmm8, zmmword ptr [rsi+r15*1]](https://www.uops.info/html-instr/VPCMPEQW_K_ZMM_M512.html)
│  1                    │   1    │   1   │   1                                                                                     │ [vpmovm2w zmm2, k1](https://www.uops.info/html-instr/VPMOVM2W_ZMM_K.html)
│  1                    │   1    │   1   │  0.39                                         0.61                                      │ [vpternlogd zmm2, zmm1, zmm0, 0x80](https://www.uops.info/html-instr/VPTERNLOGD_ZMM_ZMM_ZMM_I8.html)
│  1                    │   1    │   1   │                                                1                                        │ [vptestmb k1, zmm2, zmm2](https://www.uops.info/html-instr/VPTESTMB_K_ZMM_ZMM.html)
│  1                    │   1    │   1   │   1                                                                                     │ [kortestq k1, k1](https://www.uops.info/html-instr/KORTESTQ_K_K.html)
├───────────────────────┼────────┼───────┼─────────────────────────────────────────────────────────────────────────────────────────┤
│  12                   │   12   │  12   │  4.33              1.51     1.49              4.67                                      │ Total
└───────────────────────┴────────┴───────┴─────────────────────────────────────────────────────────────────────────────────────────┘

While the new is:

Throughput (in cycles per iteration): 4.30
Bottleneck: Scheduling

The following throughputs could be achieved if the given property were the only bottleneck:

  - Predecoder: 2.94
  - Decoder: 4.00
  - Issue: 2.20
  - Ports: 4.00

┌───────────────────────┬────────┬───────┬─────────────────────────────────────────────────────────────────────────────────────────┐
│ MITE   MS   DSB   LSD │ Issued │ Exec. │ Port 0   Port 1   Port 2   Port 3   Port 4   Port 5   Port 6   Port 7   Port 8   Port 9 │
├───────────────────────┼────────┼───────┼─────────────────────────────────────────────────────────────────────────────────────────┤
│  2                    │   2    │   2   │                    0.5      0.5                1                                        │ [vpcmpeqw k1, zmm6, zmmword ptr [rsi]](https://www.uops.info/html-instr/VPCMPEQW_K_ZMM_M512.html)
│  2                    │   2    │   2   │                    0.5      0.5                1                                        │ [vpcmpeqw k2, zmm7, zmmword ptr [rsi+r14*1]](https://www.uops.info/html-instr/VPCMPEQW_K_ZMM_M512.html)
│  1                    │   1    │   1   │   1                                                                                     │ [kandd k1, k1, k2](https://www.uops.info/html-instr/KANDD_K_K_K.html)
│  2                    │   2    │   2   │                    0.5      0.5                1                                        │ [vpcmpeqw k2, zmm8, zmmword ptr [rsi+r15*1]](https://www.uops.info/html-instr/VPCMPEQW_K_ZMM_M512.html)
│  1                    │   1    │   1   │   1                                                                                     │ [kandd k1, k1, k2](https://www.uops.info/html-instr/KANDD_K_K_K.html)
│  1                    │   1    │   1   │  0.7                                          0.3                                       │ [vpmovm2w zmm0, k1](https://www.uops.info/html-instr/VPMOVM2W_ZMM_K.html)
│  1                    │   1    │   1   │                                                1                                        │ [vptestmb k1, zmm0, zmm0](https://www.uops.info/html-instr/VPTESTMB_K_ZMM_ZMM.html)
│  1                    │   1    │   1   │   1                                                                                     │ [kortestq k1, k1](https://www.uops.info/html-instr/KORTESTQ_K_K.html)
├───────────────────────┼────────┼───────┼─────────────────────────────────────────────────────────────────────────────────────────┤
│  11                   │   11   │  11   │  3.7               1.5      1.5               4.3                                       │ Total
└───────────────────────┴────────┴───────┴─────────────────────────────────────────────────────────────────────────────────────────┘

If we also resolve the vpmovm2, vptestm, kortest sequence to just kortest, we would get:

Throughput (in cycles per iteration): 3.00
Bottlenecks: Decoder, Ports

The following throughputs could be achieved if the given property were the only bottleneck:

  - Predecoder: 2.19
  - Decoder: 3.00
  - Issue: 1.80
  - Ports: 3.00

┌───────────────────────┬────────┬───────┬─────────────────────────────────────────────────────────────────────────────────────────┐
│ MITE   MS   DSB   LSD │ Issued │ Exec. │ Port 0   Port 1   Port 2   Port 3   Port 4   Port 5   Port 6   Port 7   Port 8   Port 9 │
├───────────────────────┼────────┼───────┼─────────────────────────────────────────────────────────────────────────────────────────┤
│  2                    │   2    │   2   │                    0.5      0.5                1                                        │ [vpcmpeqw k1, zmm6, zmmword ptr [rsi]](https://www.uops.info/html-instr/VPCMPEQW_K_ZMM_M512.html)
│  2                    │   2    │   2   │                    0.5      0.5                1                                        │ [vpcmpeqw k2, zmm7, zmmword ptr [rsi+r14*1]](https://www.uops.info/html-instr/VPCMPEQW_K_ZMM_M512.html)
│  1                    │   1    │   1   │   1                                                                                     │ [kandd k1, k1, k2](https://www.uops.info/html-instr/KANDD_K_K_K.html)
│  2                    │   2    │   2   │                    0.5      0.5                1                                        │ [vpcmpeqw k2, zmm8, zmmword ptr [rsi+r15*1]](https://www.uops.info/html-instr/VPCMPEQW_K_ZMM_M512.html)
│  1                    │   1    │   1   │   1                                                                                     │ [kandd k1, k1, k2](https://www.uops.info/html-instr/KANDD_K_K_K.html)
│  1                    │   1    │   1   │   1                                                                                     │ [kortestq k1, k1](https://www.uops.info/html-instr/KORTESTQ_K_K.html)
├───────────────────────┼────────┼───────┼─────────────────────────────────────────────────────────────────────────────────────────┤
│  9                    │   9    │   9   │   3                1.5      1.5                3                                        │ Total
└───────────────────────┴────────┴───────┴─────────────────────────────────────────────────────────────────────────────────────────┘

Ensure that bitwise operations of small masks can still be folded

ab633b2

Copilot AI review requested due to automatic review settings July 21, 2025 15:55

Copilot AI reviewed Jul 21, 2025

View reviewed changes

github-actions bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Jul 21, 2025

dotnet-policy-service bot assigned tannergooding Jul 21, 2025

tannergooding mentioned this pull request Jul 21, 2025

Delete AVX512 paths from IndexOf(string) #117865

Closed

EgorBot mentioned this pull request Jul 21, 2025

Benchmarks for #117887 (MihaZupan) EgorBot/runtime-utils#445

Open

Ensure that the ability to negate a comparison isn't regressed

dd3e538

EgorBot mentioned this pull request Jul 21, 2025

Benchmarks for #117887 (EgorBo) EgorBot/runtime-utils#446

Open

Merge branch 'main' into small-mask-fold

f74508d

EgorBo approved these changes Jul 22, 2025

View reviewed changes

tannergooding merged commit 7132299 into dotnet:main Jul 22, 2025
108 of 110 checks passed

tannergooding deleted the small-mask-fold branch July 22, 2025 18:53

build-analysis bot mentioned this pull request Jul 22, 2025

System.Diagnostics.Tests.ProcessTests.TestCheckChildProcessUserAndGroupIds fails on Alpine jobs with "Operation not permitted" #117811

Closed

LoopedBard3 mentioned this pull request Jul 29, 2025

[Perf] Windows/x64: 17 Regressions on 7/22/2025 8:52:00 PM +00:00 #118176

Open

github-actions bot locked and limited conversation to collaborators Aug 22, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Ensure that bitwise operations of small masks can still be folded #117887

Ensure that bitwise operations of small masks can still be folded #117887

Uh oh!

tannergooding commented Jul 21, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

dotnet-policy-service bot commented Jul 21, 2025

Uh oh!

MihaZupan commented Jul 21, 2025

Uh oh!

MihaZupan commented Jul 21, 2025

Uh oh!

tannergooding commented Jul 21, 2025

Uh oh!

tannergooding commented Jul 21, 2025 •

edited

Loading

Uh oh!

MihaZupan commented Jul 21, 2025 •

edited

Loading

Uh oh!

tannergooding commented Jul 21, 2025

Uh oh!

tannergooding commented Jul 21, 2025

Uh oh!

EgorBo commented Jul 21, 2025

Uh oh!

tannergooding commented Jul 22, 2025 •

edited

Loading

Uh oh!

tannergooding commented Jul 22, 2025

Uh oh!

Uh oh!

Uh oh!

Ensure that bitwise operations of small masks can still be folded #117887

Ensure that bitwise operations of small masks can still be folded #117887

Uh oh!

Conversation

tannergooding commented Jul 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

dotnet-policy-service bot commented Jul 21, 2025

Uh oh!

MihaZupan commented Jul 21, 2025

Uh oh!

MihaZupan commented Jul 21, 2025

Uh oh!

tannergooding commented Jul 21, 2025

Uh oh!

tannergooding commented Jul 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Before

After

Uh oh!

MihaZupan commented Jul 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tannergooding commented Jul 21, 2025

Uh oh!

tannergooding commented Jul 21, 2025

Uh oh!

EgorBo commented Jul 21, 2025

Uh oh!

tannergooding commented Jul 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tannergooding commented Jul 22, 2025

Uh oh!

Uh oh!

Uh oh!

tannergooding commented Jul 21, 2025 •

edited

Loading

tannergooding commented Jul 21, 2025 •

edited

Loading

MihaZupan commented Jul 21, 2025 •

edited

Loading

tannergooding commented Jul 22, 2025 •

edited

Loading