Skip to content

Conversation

tannergooding
Copy link
Member

@tannergooding tannergooding commented Jul 21, 2025

This resolves an issue spotted in #117865 (comment)

@Copilot Copilot AI review requested due to automatic review settings July 21, 2025 15:55
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes an issue in the JIT compiler where bitwise operations on SIMD mask values with small element types were losing their type information during folding optimizations. The change ensures that when folding hardware intrinsic expressions involving convert operations, the original small mask type is preserved rather than being normalized to a larger integer type.

Key changes:

  • Modified type size comparison logic to compare operand types directly instead of against a common base type
  • Added preservation of the original SIMD base type from convert operations to maintain mask element count information
Comments suppressed due to low confidence (1)

@github-actions github-actions bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Jul 21, 2025
Copy link
Contributor

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

@MihaZupan
Copy link
Member

@EgorBot -amd

using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System.Buffers;

BenchmarkRunner.Run<SingleString>(args: args);

public class SingleString
{
    private static readonly SearchValues<string> s_values = SearchValues.Create([Needle], StringComparison.Ordinal);
    private static readonly SearchValues<string> s_valuesIC = SearchValues.Create([Needle], StringComparison.OrdinalIgnoreCase);
    private static readonly string s_text_noMatches = new('a', Length);
    private static readonly string s_text_falsePositives = string.Concat(Enumerable.Repeat("Sherlock Holm_s", Length / Needle.Length));

    public const int Length = 100_000;
    public const string Needle = "Sherlock Holmes";

    [Benchmark] public void Throughput() => s_text_noMatches.AsSpan().Contains(Needle, StringComparison.Ordinal);
    [Benchmark] public void SV_Throughput() => s_text_noMatches.AsSpan().ContainsAny(s_values);
    [Benchmark] public void SV_ThroughputIC() => s_text_noMatches.AsSpan().ContainsAny(s_valuesIC);

    [Benchmark] public void FalsePositives() => s_text_falsePositives.AsSpan().Contains(Needle, StringComparison.Ordinal);
    [Benchmark] public void SV_FalsePositives() => s_text_falsePositives.AsSpan().ContainsAny(s_values);
    [Benchmark] public void SV_FalsePositivesIC() => s_text_falsePositives.AsSpan().ContainsAny(s_valuesIC);
}

@MihaZupan
Copy link
Member

This appears to be regressing that path currently: EgorBot/runtime-utils#445 (comment)

@tannergooding
Copy link
Member Author

I'm not seeing the same locally on my Intel or AMD boxes.

What I am seeing is that these tests are touching enough memory that they are impacted fairly significantly by the data alignment and can vary by a decent amount across several separate runs. The work that BDN does to try and account for noise doesn't handle things like static readonly values that will get reused across multiple iterations, without being moved, and which will end up impacting the cache.

@tannergooding
Copy link
Member Author

tannergooding commented Jul 21, 2025

For example, I get the following numbers on my 7900X

Before

Method Mean Error StdDev Median Min Max Allocated
Throughput 3.174 us 0.0816 us 0.0940 us 3.135 us 3.063 us 3.315 us -
SV_Throughput 3.967 us 0.0220 us 0.0172 us 3.964 us 3.947 us 4.004 us -
SV_ThroughputIC 4.519 us 0.0854 us 0.0799 us 4.549 us 4.431 us 4.614 us -
FalsePositives 11.598 us 0.1648 us 0.1542 us 11.598 us 11.383 us 11.819 us -
SV_FalsePositives 9.393 us 0.1852 us 0.1819 us 9.468 us 9.058 us 9.696 us -
SV_FalsePositivesIC 10.724 us 0.1985 us 0.1857 us 10.705 us 10.507 us 10.984 us -

After

Method Mean Error StdDev Median Min Max Allocated
Throughput 3.042 us 0.0342 us 0.0303 us 3.032 us 3.005 us 3.105 us -
SV_Throughput 3.956 us 0.0058 us 0.0045 us 3.956 us 3.947 us 3.962 us -
SV_ThroughputIC 4.647 us 0.0276 us 0.0215 us 4.651 us 4.611 us 4.682 us -
FalsePositives 11.619 us 0.2319 us 0.2169 us 11.543 us 11.327 us 12.075 us -
SV_FalsePositives 9.513 us 0.1498 us 0.1328 us 9.518 us 9.356 us 9.758 us -
SV_FalsePositivesIC 10.930 us 0.0526 us 0.0439 us 10.928 us 10.874 us 11.026 us -

Different boxes report different numbers, but all are generally inline with the change being the same or faster and with smaller average codegen. The 7900X remains slightly pessimized for V512 related mask usage due to its double pumping, which makes the mask<->vector conversions a bit more visible.

@MihaZupan
Copy link
Member

MihaZupan commented Jul 21, 2025

can vary by a decent amount across several separate runs

Here are 7 consecutive runs from my 9950x consistently showing a similar regression as reported by the bot for the Epyc 9V74.
https://gist.github.com/MihaZupan/9cfd8e479bb95e8cd174e9704d6731ab

Or another run with LongJob (so multiple restarts) on the same Zen 4 Epyc: MihuBot/runtime-utils#1252

@tannergooding
Copy link
Member Author

Here are 7 consecutive runs from my 9950x consistently showing a similar regression as reported by the bot for the Epyc 9V74.

I'd speculate there's something else at play here in the microbenchmarks then.

I still cannot reproduce locally (Tiger Lake, Cascade Lake, Ice Lake, or Zen4) and this is a pure reduction in terms of uops, instructions, and code size. It follows the recommendations from the Intel/AMD optimization manuals and more closely matches what GCC/Clang/MSVC produce for similar code.

@tannergooding
Copy link
Member Author

more closely matches what GCC/Clang/MSVC produce for similar code.

What's lacking is still the handling of vpmovm2*; vptest and similar; which is a bit of a more involved change as indicated above.

@EgorBo
Copy link
Member

EgorBo commented Jul 21, 2025

@EgorBot -aws_sapphirelake -azure_cascadelake

using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System.Buffers;

BenchmarkRunner.Run<SingleString>(args: args);

public class SingleString
{
    private static readonly SearchValues<string> s_values = SearchValues.Create([Needle], StringComparison.Ordinal);
    private static readonly SearchValues<string> s_valuesIC = SearchValues.Create([Needle], StringComparison.OrdinalIgnoreCase);
    private static readonly string s_text_noMatches = new('a', Length);
    private static readonly string s_text_falsePositives = string.Concat(Enumerable.Repeat("Sherlock Holm_s", Length / Needle.Length));

    public const int Length = 100_000;
    public const string Needle = "Sherlock Holmes";

    [Benchmark] public void Throughput() => s_text_noMatches.AsSpan().Contains(Needle, StringComparison.Ordinal);
    [Benchmark] public void SV_Throughput() => s_text_noMatches.AsSpan().ContainsAny(s_values);
    [Benchmark] public void SV_ThroughputIC() => s_text_noMatches.AsSpan().ContainsAny(s_valuesIC);

    [Benchmark] public void FalsePositives() => s_text_falsePositives.AsSpan().Contains(Needle, StringComparison.Ordinal);
    [Benchmark] public void SV_FalsePositives() => s_text_falsePositives.AsSpan().ContainsAny(s_values);
    [Benchmark] public void SV_FalsePositivesIC() => s_text_falsePositives.AsSpan().ContainsAny(s_valuesIC);
}

@tannergooding
Copy link
Member Author

tannergooding commented Jul 22, 2025

@EgorBo numbers for sapphire rapids/cascade lake look like what I saw, and what I see on my Zen4 and Tiger Lake boxes (minus the outlier you have for FalsePositives on cascade lake).

I think this is still something we should take. This is inline with the intended optimization the path should have already been doing, is an instruction count reduction, a micro-op reduction, a (as computed by LLVM MCA and uiCA) latency reduction, a reduction in assembly size, and matches the general optimization guidelines around AVX512 to avoid unnecessary conversions between mask and vector form.

Even if there is a regression on Zen5, this would be micro-architecture specific and something that should get addressed over time, particularly as we finish the other remaining optimization work to ensure we consume the kmask directly in the subsequent branch tests (i.e. we generate kortest directly, rather than vpmovm2*; vptest)

@tannergooding
Copy link
Member Author

For reference, uiCA reports this for the prior code:

Throughput (in cycles per iteration): 4.62
Bottleneck: Scheduling

The following throughputs could be achieved if the given property were the only bottleneck:

  - Predecoder: 3.50
  - Decoder: 4.00
  - Issue: 2.40
  - Ports: 4.50

┌───────────────────────┬────────┬───────┬─────────────────────────────────────────────────────────────────────────────────────────┐
│ MITE   MS   DSB   LSD │ Issued │ Exec. │ Port 0   Port 1   Port 2   Port 3   Port 4   Port 5   Port 6   Port 7   Port 8   Port 9 │
├───────────────────────┼────────┼───────┼─────────────────────────────────────────────────────────────────────────────────────────┤
│  2                    │   2    │   2   │                    0.51     0.49               1                                        │ [vpcmpeqw k1, zmm6, zmmword ptr [rsi]](https://www.uops.info/html-instr/VPCMPEQW_K_ZMM_M512.html)
│  1                    │   1    │   1   │  0.98                                         0.02                                      │ [vpmovm2w zmm0, k1](https://www.uops.info/html-instr/VPMOVM2W_ZMM_K.html)
│  2                    │   2    │   2   │                    0.49     0.51               1                                        │ [vpcmpeqw k1, zmm7, zmmword ptr [rsi+r14*1]](https://www.uops.info/html-instr/VPCMPEQW_K_ZMM_M512.html)
│  1                    │   1    │   1   │  0.96                                         0.04                                      │ [vpmovm2w zmm1, k1](https://www.uops.info/html-instr/VPMOVM2W_ZMM_K.html)
│  2                    │   2    │   2   │                    0.51     0.49               1                                        │ [vpcmpeqw k1, zmm8, zmmword ptr [rsi+r15*1]](https://www.uops.info/html-instr/VPCMPEQW_K_ZMM_M512.html)
│  1                    │   1    │   1   │   1                                                                                     │ [vpmovm2w zmm2, k1](https://www.uops.info/html-instr/VPMOVM2W_ZMM_K.html)
│  1                    │   1    │   1   │  0.39                                         0.61                                      │ [vpternlogd zmm2, zmm1, zmm0, 0x80](https://www.uops.info/html-instr/VPTERNLOGD_ZMM_ZMM_ZMM_I8.html)
│  1                    │   1    │   1   │                                                1                                        │ [vptestmb k1, zmm2, zmm2](https://www.uops.info/html-instr/VPTESTMB_K_ZMM_ZMM.html)
│  1                    │   1    │   1   │   1                                                                                     │ [kortestq k1, k1](https://www.uops.info/html-instr/KORTESTQ_K_K.html)
├───────────────────────┼────────┼───────┼─────────────────────────────────────────────────────────────────────────────────────────┤
│  12                   │   12   │  12   │  4.33              1.51     1.49              4.67                                      │ Total
└───────────────────────┴────────┴───────┴─────────────────────────────────────────────────────────────────────────────────────────┘

While the new is:

Throughput (in cycles per iteration): 4.30
Bottleneck: Scheduling

The following throughputs could be achieved if the given property were the only bottleneck:

  - Predecoder: 2.94
  - Decoder: 4.00
  - Issue: 2.20
  - Ports: 4.00

┌───────────────────────┬────────┬───────┬─────────────────────────────────────────────────────────────────────────────────────────┐
│ MITE   MS   DSB   LSD │ Issued │ Exec. │ Port 0   Port 1   Port 2   Port 3   Port 4   Port 5   Port 6   Port 7   Port 8   Port 9 │
├───────────────────────┼────────┼───────┼─────────────────────────────────────────────────────────────────────────────────────────┤
│  2                    │   2    │   2   │                    0.5      0.5                1                                        │ [vpcmpeqw k1, zmm6, zmmword ptr [rsi]](https://www.uops.info/html-instr/VPCMPEQW_K_ZMM_M512.html)
│  2                    │   2    │   2   │                    0.5      0.5                1                                        │ [vpcmpeqw k2, zmm7, zmmword ptr [rsi+r14*1]](https://www.uops.info/html-instr/VPCMPEQW_K_ZMM_M512.html)
│  1                    │   1    │   1   │   1                                                                                     │ [kandd k1, k1, k2](https://www.uops.info/html-instr/KANDD_K_K_K.html)
│  2                    │   2    │   2   │                    0.5      0.5                1                                        │ [vpcmpeqw k2, zmm8, zmmword ptr [rsi+r15*1]](https://www.uops.info/html-instr/VPCMPEQW_K_ZMM_M512.html)
│  1                    │   1    │   1   │   1                                                                                     │ [kandd k1, k1, k2](https://www.uops.info/html-instr/KANDD_K_K_K.html)
│  1                    │   1    │   1   │  0.7                                          0.3                                       │ [vpmovm2w zmm0, k1](https://www.uops.info/html-instr/VPMOVM2W_ZMM_K.html)
│  1                    │   1    │   1   │                                                1                                        │ [vptestmb k1, zmm0, zmm0](https://www.uops.info/html-instr/VPTESTMB_K_ZMM_ZMM.html)
│  1                    │   1    │   1   │   1                                                                                     │ [kortestq k1, k1](https://www.uops.info/html-instr/KORTESTQ_K_K.html)
├───────────────────────┼────────┼───────┼─────────────────────────────────────────────────────────────────────────────────────────┤
│  11                   │   11   │  11   │  3.7               1.5      1.5               4.3                                       │ Total
└───────────────────────┴────────┴───────┴─────────────────────────────────────────────────────────────────────────────────────────┘

If we also resolve the vpmovm2, vptestm, kortest sequence to just kortest, we would get:

Throughput (in cycles per iteration): 3.00
Bottlenecks: Decoder, Ports

The following throughputs could be achieved if the given property were the only bottleneck:

  - Predecoder: 2.19
  - Decoder: 3.00
  - Issue: 1.80
  - Ports: 3.00

┌───────────────────────┬────────┬───────┬─────────────────────────────────────────────────────────────────────────────────────────┐
│ MITE   MS   DSB   LSD │ Issued │ Exec. │ Port 0   Port 1   Port 2   Port 3   Port 4   Port 5   Port 6   Port 7   Port 8   Port 9 │
├───────────────────────┼────────┼───────┼─────────────────────────────────────────────────────────────────────────────────────────┤
│  2                    │   2    │   2   │                    0.5      0.5                1                                        │ [vpcmpeqw k1, zmm6, zmmword ptr [rsi]](https://www.uops.info/html-instr/VPCMPEQW_K_ZMM_M512.html)
│  2                    │   2    │   2   │                    0.5      0.5                1                                        │ [vpcmpeqw k2, zmm7, zmmword ptr [rsi+r14*1]](https://www.uops.info/html-instr/VPCMPEQW_K_ZMM_M512.html)
│  1                    │   1    │   1   │   1                                                                                     │ [kandd k1, k1, k2](https://www.uops.info/html-instr/KANDD_K_K_K.html)
│  2                    │   2    │   2   │                    0.5      0.5                1                                        │ [vpcmpeqw k2, zmm8, zmmword ptr [rsi+r15*1]](https://www.uops.info/html-instr/VPCMPEQW_K_ZMM_M512.html)
│  1                    │   1    │   1   │   1                                                                                     │ [kandd k1, k1, k2](https://www.uops.info/html-instr/KANDD_K_K_K.html)
│  1                    │   1    │   1   │   1                                                                                     │ [kortestq k1, k1](https://www.uops.info/html-instr/KORTESTQ_K_K.html)
├───────────────────────┼────────┼───────┼─────────────────────────────────────────────────────────────────────────────────────────┤
│  9                    │   9    │   9   │   3                1.5      1.5                3                                        │ Total
└───────────────────────┴────────┴───────┴─────────────────────────────────────────────────────────────────────────────────────────┘

@tannergooding tannergooding merged commit 7132299 into dotnet:main Jul 22, 2025
108 of 110 checks passed
@tannergooding tannergooding deleted the small-mask-fold branch July 22, 2025 18:53
@github-actions github-actions bot locked and limited conversation to collaborators Aug 22, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants