Skip to content

Conversation

Gadflyii
Copy link

@Gadflyii Gadflyii commented Sep 28, 2025

This change adds a new toggle, "--amx" that will allow the extra buft to remain functional when a GPU is present, enabling AMX operations when in a CPU/GPU hybrid. If the "--amx" toggle is not present, current behavior is maintained.

  • The toggle is functional in llama-bench, llama-cli, and llama-server.
  • Compatible with all --cpu-moe, --n-cpu-moe N, --cpu-moe-draft, and --n-cpu-moe-draft N as implemented on Sep 27th, 2025.
  • Compatible with all Sapphire Rapids, Emerald Rapids, and Granite rapids CPU's.
  • If "--amx" is accidentally enabled on non-Intel CPU's, or Intel CPU's without AMX, there is no change in behavior (Tested with AMD 9950X3D & Intel 14900k).
  • Works in WSL or native Linux (Tested Ubuntu 24.04 LTS and Windows 11 + Ubuntu WSL).

This change allows significant performance increases on the CPU offloaded layers / moe while in hybrid operations; especially in prompt eval, where with 100%-150%+ performance uplifts are common:

Examples:

Base command:

numactl -N 2,3 -m 2,3 ~/src/llama.cpp/build/bin/llama-cli -m /mnt/ssd2/AI/Qwen3_30B/Q4_0/Qwen3-30B-A3B-Thinking-2507-Q4_0.gguf  -t 64 -b 1024 -c 1024 -n 1024 --numa numactl -p "The quick brown fox jumps over the lazy dog many times. A curious cat watches carefully from the garden wall nearby. Birds sing softly in the morning air, while the sun rises gently above the hills. Children walk slowly to school carrying bright backpacks filled with books, pencils, and small notes. The teacher greets them warmly at the classroom door. Lessons begin with stories about science, history, art, and music. Ideas flow clearly and simply, creating a calm rhythm of learning. Friends share smiles, trade sandwiches, and laugh during the short break. The day continues peacefully until the afternoon bell finally rings." -no-cnv --n-gpu-layers 10

No AMX (Current behavior):

llama_perf_sampler_print:    sampling time =      91.57 ms /   927 runs   (    0.10 ms per token, 10123.18 tokens per second)
llama_perf_context_print:        load time =    1202.46 ms
llama_perf_context_print: prompt eval time =    1020.54 ms /   122 tokens (    8.37 ms per token,   119.54 tokens per second)
llama_perf_context_print:        eval time =   22999.39 ms /   804 runs   (   28.61 ms per token,    34.96 tokens per second)
llama_perf_context_print:       total time =   24432.43 ms /   926 tokens
llama_perf_context_print:    graphs reused =        800
llama_memory_breakdown_print: | memory breakdown [MiB] | total    free     self   model   context   compute    unaccounted |
llama_memory_breakdown_print: |   - CUDA0 (RTX 5090)   | 32086 = 26593 + ( 3915 =  3351 +      20 +     544) +        1578 |
llama_memory_breakdown_print: |   - Host               |                  13299 = 13217 +      76 +       6                |

W/ "--amx":

llama_perf_sampler_print:    sampling time =     100.54 ms /  1024 runs   (    0.10 ms per token, 10185.00 tokens per second)
llama_perf_context_print:        load time =    9185.60 ms
llama_perf_context_print: prompt eval time =     478.09 ms /   122 tokens (    3.92 ms per token,   255.18 tokens per second)
llama_perf_context_print:        eval time =   22453.23 ms /   901 runs   (   24.92 ms per token,    40.13 tokens per second)
llama_perf_context_print:       total time =   23289.67 ms /  1023 tokens
llama_perf_context_print:    graphs reused =        897
llama_memory_breakdown_print: | memory breakdown [MiB] | total    free     self   model   context   compute    unaccounted |
llama_memory_breakdown_print: |   - CUDA0 (RTX 5090)   | 32086 = 26885 + ( 3571 =  3351 +      20 +     200) +        1629 |
llama_memory_breakdown_print: |   - Host               |                  13243 = 12866 +      76 +     300                |
llama_memory_breakdown_print: |   - CPU_REPACK         |                  11664 = 11664 +       0 +       0                |
llama_memory_breakdown_print: |   - AMX                |                    628 =   628 +       0 +       0                |

Results:

Prompt Evaluation | 119.54 tps | 255.18 tps | +135.64 | +113.47%
Token Evaluation | 34.96 tps | 40.13 tps | +5.17 | +14.79%
Overall Inference | 37.90 tps | 43.93 tps | +6.02 | +15.90%
Sampling | 10123.18 tps | 10185.00 tps | +61.82 | +0.61%

With "--cpu-moe":

No AMX (Current behavior):

llama_perf_sampler_print:    sampling time =     102.79 ms /  1024 runs   (    0.10 ms per token,  9961.96 tokens per second)
llama_perf_context_print:        load time =     615.24 ms
llama_perf_context_print: prompt eval time =    1198.63 ms /   122 tokens (    9.82 ms per token,   101.78 tokens per second)
llama_perf_context_print:        eval time =   27418.81 ms /   901 runs   (   30.43 ms per token,    32.86 tokens per second)
llama_perf_context_print:       total time =   29076.75 ms /  1023 tokens
llama_perf_context_print:    graphs reused =        897
llama_memory_breakdown_print: | memory breakdown [MiB] | total    free     self   model   context   compute    unaccounted |
llama_memory_breakdown_print: |   - CUDA0 (RTX 5090)   | 32086 = 29776 + (  675 =   111 +      20 +     544) +        1634 |
llama_memory_breakdown_print: |   - Host               |                  16651 = 16569 +      76 +       6                |

W/ "--amx":

llama_perf_sampler_print:    sampling time =     100.07 ms /  1024 runs   (    0.10 ms per token, 10232.84 tokens per second)
llama_perf_context_print:        load time =   10873.17 ms
llama_perf_context_print: prompt eval time =     530.19 ms /   122 tokens (    4.35 ms per token,   230.11 tokens per second)
llama_perf_context_print:        eval time =   23928.42 ms /   901 runs   (   26.56 ms per token,    37.65 tokens per second)
llama_perf_context_print:       total time =   24809.11 ms /  1023 tokens
llama_perf_context_print:    graphs reused =        897
llama_memory_breakdown_print: | memory breakdown [MiB] | total    free     self   model   context   compute    unaccounted |
llama_memory_breakdown_print: |   - CUDA0 (RTX 5090)   | 32086 = 30115 + (  331 =   111 +      20 +     200) +        1640 |
llama_memory_breakdown_print: |   - Host               |                  13243 = 12866 +      76 +     300                |
llama_memory_breakdown_print: |   - CPU_REPACK         |                  14904 = 14904 +       0 +       0                |
llama_memory_breakdown_print: |   - AMX                |                    628 =   628 +       0 +       0                |

Results:

Prompt Evaluation | 101.78 tps | 230.11 tps | +128.33 | +126.06%
Token Evaluation | 32.86 tps | 37.65 tps | +4.79 | +14.58%
Overall Inference | 35.18 tps | 41.24 tps | +6.06 | +17.23%
Sampling | 9961.96 tps | 10232.84 tps | +270.88 | +2.72%

@Gadflyii
Copy link
Author

Let me know if you have any questions.

@slaren
Copy link
Member

slaren commented Sep 28, 2025

This should already be possible with the more generic command line option -nr, --no-repack.

Nvm that, that option does the opposite. I think the better solution would be to add an option to disable host buffer types in make_cpu_buft_list.

@Gadflyii
Copy link
Author

This should already be possible with the more generic command line option -nr, --no-repack.

Nvm that, that option does the opposite. I think the better solution would be to add an option to disable host buffer types in make_cpu_buft_list.

I have played with that little, but found I couldn't get it to work / work as expected. I think it is due to how the extra bufts are implemented as part of the original AMX PR. Not all the CPU weights go into the CPU_REPACK / AMX bufts, so I think we need to maintain the CPU_Mapped model buffer + the extra bufts CPU_REPACK and AMX?

Is that what you meant?

@slaren
Copy link
Member

slaren commented Sep 28, 2025

What I mean is adding an option to skip adding the host buffer types here:

// add a host buffer type
// storing the tensors in a host buffer is useful when the processing of large batches
// is offloaded to a GPU device, since it reduces the time spent on data transfers
// generally, this will be done using the first device in the list
// a better approach would be to handle this on a weight-by-weight basis using the offload_op
// function of the device to determine if it would benefit from being stored in a host buffer
for (auto * dev : devices) {
ggml_backend_buffer_type_t buft = ggml_backend_dev_host_buffer_type(dev);
if (buft) {
buft_list.emplace_back(dev, buft);
break;
}
}

The reason the extra buffer types don't get used when there is a GPU, is because the host buffer types have higher priority. Alternatively, the option could give repack buffers higher priority, but still keep the host buffer types.

@Gadflyii
Copy link
Author

What I mean is adding an option to skip adding the host buffer types here:

// add a host buffer type
// storing the tensors in a host buffer is useful when the processing of large batches
// is offloaded to a GPU device, since it reduces the time spent on data transfers
// generally, this will be done using the first device in the list
// a better approach would be to handle this on a weight-by-weight basis using the offload_op
// function of the device to determine if it would benefit from being stored in a host buffer
for (auto * dev : devices) {
ggml_backend_buffer_type_t buft = ggml_backend_dev_host_buffer_type(dev);
if (buft) {
buft_list.emplace_back(dev, buft);
break;
}
}

The reason the extra buffer types don't get used when there is a GPU, is because the host buffer types have higher priority. Alternatively, the option could give repack buffers higher priority, but still keep the host buffer types.

I will make the change and update the PR

@Gadflyii
Copy link
Author

@slaren any feedback on what the "opt-in" switch should be called? I can keep it "--amx" or I can make it more generic, something like "--xbuffers" in case there are any other extra buffers added in the future?

@slaren
Copy link
Member

slaren commented Sep 29, 2025

I am not sure what would be the opt-in switch. What I am proposing is a flag to disable host buffer types, and it should be called something like --no-host.

@Gadflyii
Copy link
Author

@slaren all changes have been made.

All feedback welcomed, and thank you for all your help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants