-
Notifications
You must be signed in to change notification settings - Fork 13.2k
model : add GroveMoE support #15510
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
model : add GroveMoE support #15510
Conversation
Looks like ccache breaks the build (using cached files newer than this branch), not important right now though... |
@slaren gentle ping |
common/arg.cpp
Outdated
"keep all Mixture of Experts (MoE) weights in the CPU", | ||
[](common_params & params) { | ||
params.tensor_buft_overrides.push_back({"\\.ffn_(up|down|gate)_exps", ggml_backend_cpu_buffer_type()}); | ||
params.tensor_buft_overrides.push_back({"\\.ffn_(up|down|gate)_(ch|)exps", ggml_backend_cpu_buffer_type()}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The regexs are now in common.h
|
||
ggml_tensor * weights = ggml_get_rows(ctx0, | ||
ggml_reshape_3d(ctx0, probs, 1, n_expert, n_tokens), selected_experts); // [1, n_expert_used, n_tokens] | ||
if (arch == LLM_ARCH_GROVEMOE && n_expert != hparams.n_expert) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When is n_expert != hparams.n_expert
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When doing the adjugate experts pass:
Lines 19025 to 19038 in ee51669
// TODO: Only do the expert selection and weights once | |
moe_out = | |
build_moe_ffn(cur, | |
nullptr, | |
model.layers[il].ffn_up_chexps, | |
model.layers[il].ffn_gate_chexps, | |
model.layers[il].ffn_down_chexps, | |
nullptr, | |
n_chunk_expert, n_expert_used > n_chunk_expert ? n_chunk_expert : n_expert_used, | |
LLM_FFN_SILU, true, | |
false, 0.0, | |
LLAMA_EXPERT_GATING_FUNC_TYPE_SOFTMAX, | |
il, probs); | |
cb(moe_out, "ffn_adj_moe_out", il); |
Oh, I just noticed it breaks (just outputs endless |
Are you running F16 weights? If yes, there is a chance you are hitting this assert: llama.cpp/ggml/src/ggml-cpu/vec.cpp Lines 327 to 328 in 152729f
Build in Debug to confirm that. |
Nope, Q8_0.
I will do that and try to figure out the issue later. |
Didn't catch anything, however when I run it through |
Ok, fully offloading works fine too, so this is unlikely to be a model issue, just seems to be triggering some problem with partial offloading of experts. Merging. |
Btw, which was the first op that produced the NaN when you ran the |
Edit: |
Ok, I'll likely take a look when the GGUFs appear and if I don't forget. |
* add GroveMoE support * remove constexpr that fails on certain compilers * revert crude scalar div implementation, use cast * build_attn_inp_kv_unified -> build_attn_inp_kv * fix build_attn * re-apply ffn_exps regex changes
I'm attempting to make an imatrix and im getting this error:
|
Interesting, is it fully offloaded? |
I'm also receiving a similar error:
This is running Gabriel Larson's F16 GGUF, fully offloaded and using a rocm rc7-rocwmma docker/toolbox (I'm using a Strix Halo APU), running llama.cpp Unfortunately it appears the core dump is getting snatched up by Fedora (the joys of Bazzite), but if it's useful, I could try to finagle my settings to get the file. If I use This Q8 GGUF instead then it loads and runs without issue. |
Even more interesting, so seems to be an issue with that GGUF. @gabriellarson Can you try creating a |
* add GroveMoE support * remove constexpr that fails on certain compilers * revert crude scalar div implementation, use cast * build_attn_inp_kv_unified -> build_attn_inp_kv * fix build_attn * re-apply ffn_exps regex changes
Adds support for inclusionAI/GroveMoE, a novel adjugate experts grouped with ordinary experts architecture (paper).
The PR is in a fully working state, but I submit it as draft because it requires a scalar div implementation that was quickly hacked together just to get the model running. Only div is (very crudely) implemented, and only for CPU (doesn't matter, not much computation is spent here), and I'm not satisfied that the API makes sense, in short this requires more thought!