Skip to content

Conversation

CISC
Copy link
Collaborator

@CISC CISC commented Sep 28, 2025

@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Sep 28, 2025
@ggerganov
Copy link
Member

Don't think this is correct. The order for the GGML_F32_VEC_FMA macro is always the same, but it can differ on some instructions sets. Which one are you using?

I think this one is wrong:

#if defined(__FMA__)
// TODO: Does this work?
#define GGML_F32x4_FMA(a, b, c) _mm_fmadd_ps(b, c, a)
#else

https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#text=_mm_fmadd_ps&ig_expand=3103

@CISC
Copy link
Collaborator Author

CISC commented Sep 28, 2025

Don't think this is correct. The order for the GGML_F32_VEC_FMA macro is always the same, but it can differ on some instructions sets. Which one are you using?

This one, but if I change that everything else breaks

#if defined(__FMA__)
#define GGML_F32x8_FMA(a, b, c) _mm256_fmadd_ps(b, c, a)
#else
#define GGML_F32x8_FMA(a, b, c) _mm256_add_ps(_mm256_mul_ps(b, c), a)
#endif

@CISC
Copy link
Collaborator Author

CISC commented Sep 28, 2025

This is confusing, some places it's used like this:

ay2 = GGML_F32_VEC_FMA(ay2, ax2, vx);

and other places like this:
sum[k][j] = GGML_F16_VEC_FMA(sum[k][j], ax[j], ay[j]);

@CISC
Copy link
Collaborator Author

CISC commented Sep 28, 2025

According to Intel it should be a * b + c, so something is screwy:
https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#text=_mm256_fmadd_ps&ig_expand=3107

@ggerganov
Copy link
Member

It appears that the meaning of GGML_F32_VEC_FMA(a, b, c) and GGML_F16_VEC_FMA(a, b, c) is intended to be:

b*c + a

So the change that you propose makes sense now and this is inline with the Intel docs.

@CISC
Copy link
Collaborator Author

CISC commented Sep 28, 2025

So the change that you propose makes sense now and this is inline with the Intel docs.

Yes, I think so too, it's just somewhat confusing with the order and naming between vec_mad1 and vec_mad functions...

@CISC
Copy link
Collaborator Author

CISC commented Sep 28, 2025

It seems this codepath is not taken when built for CUDA as the backend-ops test succeeds...

Edit: I guess this means we never really test hardware optimized CPU backend?

@ggerganov
Copy link
Member

ggerganov commented Sep 28, 2025

It seems this codepath is not taken when built for CUDA as the backend-ops test succeeds...

Edit: I guess this means we never really test hardware optimized CPU backend?

I think the reason we don't see it is because the tests currently have ne0 == 10 which is too small to enter the SIMD path. If you add a larger tests (for example, ne0 == 100) it should fail.

test_cases.emplace_back(new test_scale());
test_cases.emplace_back(new test_scale(GGML_TYPE_F32, {10, 10, 10, 10}, 2.0f, 1.0f));

@CISC
Copy link
Collaborator Author

CISC commented Sep 28, 2025

It seems this codepath is not taken when built for CUDA as the backend-ops test succeeds...
Edit: I guess this means we never really test hardware optimized CPU backend?

I think the reason we don't see it is because the tests currently have ne0 == 10 which is too small to enter the SIMD path. If you add a larger tests (for example, ne0 == 100) it should fail.

Yep, it does, I'll add the test.

@github-actions github-actions bot added the testing Everything test related label Sep 28, 2025
@CISC CISC merged commit b887d2f into master Sep 28, 2025
63 of 67 checks passed
@CISC CISC deleted the cisc/fix-ggml-vec-mad1-f32-fma branch September 28, 2025 21:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning testing Everything test related
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants