Enable Intel AMX acceleration while in CPU/GPU hybrid with new "--amx" toggle. #16310

Gadflyii · 2025-09-28T15:44:10Z

This change adds a new toggle, "--amx" that will allow the extra buft to remain functional when a GPU is present, enabling AMX operations when in a CPU/GPU hybrid. If the "--amx" toggle is not present, current behavior is maintained.

The toggle is functional in llama-bench, llama-cli, and llama-server.
Compatible with all --cpu-moe, --n-cpu-moe N, --cpu-moe-draft, and --n-cpu-moe-draft N as implemented on Sep 27th, 2025.
Compatible with all Sapphire Rapids, Emerald Rapids, and Granite rapids CPU's.
If "--amx" is accidentally enabled on non-Intel CPU's, or Intel CPU's without AMX, there is no change in behavior (Tested with AMD 9950X3D & Intel 14900k).
Works in WSL or native Linux (Tested Ubuntu 24.04 LTS and Windows 11 + Ubuntu WSL).

This change allows significant performance increases on the CPU offloaded layers / moe while in hybrid operations; especially in prompt eval, where with 100%-150%+ performance uplifts are common:

Examples:

Base command:

numactl -N 2,3 -m 2,3 ~/src/llama.cpp/build/bin/llama-cli -m /mnt/ssd2/AI/Qwen3_30B/Q4_0/Qwen3-30B-A3B-Thinking-2507-Q4_0.gguf  -t 64 -b 1024 -c 1024 -n 1024 --numa numactl -p "The quick brown fox jumps over the lazy dog many times. A curious cat watches carefully from the garden wall nearby. Birds sing softly in the morning air, while the sun rises gently above the hills. Children walk slowly to school carrying bright backpacks filled with books, pencils, and small notes. The teacher greets them warmly at the classroom door. Lessons begin with stories about science, history, art, and music. Ideas flow clearly and simply, creating a calm rhythm of learning. Friends share smiles, trade sandwiches, and laugh during the short break. The day continues peacefully until the afternoon bell finally rings." -no-cnv --n-gpu-layers 10

No AMX (Current behavior):

llama_perf_sampler_print:    sampling time =      91.57 ms /   927 runs   (    0.10 ms per token, 10123.18 tokens per second)
llama_perf_context_print:        load time =    1202.46 ms
llama_perf_context_print: prompt eval time =    1020.54 ms /   122 tokens (    8.37 ms per token,   119.54 tokens per second)
llama_perf_context_print:        eval time =   22999.39 ms /   804 runs   (   28.61 ms per token,    34.96 tokens per second)
llama_perf_context_print:       total time =   24432.43 ms /   926 tokens
llama_perf_context_print:    graphs reused =        800
llama_memory_breakdown_print: | memory breakdown [MiB] | total    free     self   model   context   compute    unaccounted |
llama_memory_breakdown_print: |   - CUDA0 (RTX 5090)   | 32086 = 26593 + ( 3915 =  3351 +      20 +     544) +        1578 |
llama_memory_breakdown_print: |   - Host               |                  13299 = 13217 +      76 +       6                |

W/ "--amx":

llama_perf_sampler_print:    sampling time =     100.54 ms /  1024 runs   (    0.10 ms per token, 10185.00 tokens per second)
llama_perf_context_print:        load time =    9185.60 ms
llama_perf_context_print: prompt eval time =     478.09 ms /   122 tokens (    3.92 ms per token,   255.18 tokens per second)
llama_perf_context_print:        eval time =   22453.23 ms /   901 runs   (   24.92 ms per token,    40.13 tokens per second)
llama_perf_context_print:       total time =   23289.67 ms /  1023 tokens
llama_perf_context_print:    graphs reused =        897
llama_memory_breakdown_print: | memory breakdown [MiB] | total    free     self   model   context   compute    unaccounted |
llama_memory_breakdown_print: |   - CUDA0 (RTX 5090)   | 32086 = 26885 + ( 3571 =  3351 +      20 +     200) +        1629 |
llama_memory_breakdown_print: |   - Host               |                  13243 = 12866 +      76 +     300                |
llama_memory_breakdown_print: |   - CPU_REPACK         |                  11664 = 11664 +       0 +       0                |
llama_memory_breakdown_print: |   - AMX                |                    628 =   628 +       0 +       0                |

Results:

Prompt Evaluation | 119.54 tps | 255.18 tps | +135.64 | +113.47%
Token Evaluation | 34.96 tps | 40.13 tps | +5.17 | +14.79%
Overall Inference | 37.90 tps | 43.93 tps | +6.02 | +15.90%
Sampling | 10123.18 tps | 10185.00 tps | +61.82 | +0.61%

With "--cpu-moe":

No AMX (Current behavior):

llama_perf_sampler_print:    sampling time =     102.79 ms /  1024 runs   (    0.10 ms per token,  9961.96 tokens per second)
llama_perf_context_print:        load time =     615.24 ms
llama_perf_context_print: prompt eval time =    1198.63 ms /   122 tokens (    9.82 ms per token,   101.78 tokens per second)
llama_perf_context_print:        eval time =   27418.81 ms /   901 runs   (   30.43 ms per token,    32.86 tokens per second)
llama_perf_context_print:       total time =   29076.75 ms /  1023 tokens
llama_perf_context_print:    graphs reused =        897
llama_memory_breakdown_print: | memory breakdown [MiB] | total    free     self   model   context   compute    unaccounted |
llama_memory_breakdown_print: |   - CUDA0 (RTX 5090)   | 32086 = 29776 + (  675 =   111 +      20 +     544) +        1634 |
llama_memory_breakdown_print: |   - Host               |                  16651 = 16569 +      76 +       6                |

W/ "--amx":

llama_perf_sampler_print:    sampling time =     100.07 ms /  1024 runs   (    0.10 ms per token, 10232.84 tokens per second)
llama_perf_context_print:        load time =   10873.17 ms
llama_perf_context_print: prompt eval time =     530.19 ms /   122 tokens (    4.35 ms per token,   230.11 tokens per second)
llama_perf_context_print:        eval time =   23928.42 ms /   901 runs   (   26.56 ms per token,    37.65 tokens per second)
llama_perf_context_print:       total time =   24809.11 ms /  1023 tokens
llama_perf_context_print:    graphs reused =        897
llama_memory_breakdown_print: | memory breakdown [MiB] | total    free     self   model   context   compute    unaccounted |
llama_memory_breakdown_print: |   - CUDA0 (RTX 5090)   | 32086 = 30115 + (  331 =   111 +      20 +     200) +        1640 |
llama_memory_breakdown_print: |   - Host               |                  13243 = 12866 +      76 +     300                |
llama_memory_breakdown_print: |   - CPU_REPACK         |                  14904 = 14904 +       0 +       0                |
llama_memory_breakdown_print: |   - AMX                |                    628 =   628 +       0 +       0                |

Results:

Prompt Evaluation | 101.78 tps | 230.11 tps | +128.33 | +126.06%
Token Evaluation | 32.86 tps | 37.65 tps | +4.79 | +14.58%
Overall Inference | 35.18 tps | 41.24 tps | +6.06 | +17.23%
Sampling | 9961.96 tps | 10232.84 tps | +270.88 | +2.72%

Gadflyii · 2025-09-28T16:43:52Z

Let me know if you have any questions.

slaren · 2025-09-28T17:06:43Z

~~This should already be possible with the more generic command line option -nr, --no-repack.~~

Nvm that, that option does the opposite. I think the better solution would be to add an option to disable host buffer types in make_cpu_buft_list.

Gadflyii · 2025-09-28T19:36:04Z

~~This should already be possible with the more generic command line option -nr, --no-repack.~~

Nvm that, that option does the opposite. I think the better solution would be to add an option to disable host buffer types in make_cpu_buft_list.

I have played with that little, but found I couldn't get it to work / work as expected. I think it is due to how the extra bufts are implemented as part of the original AMX PR. Not all the CPU weights go into the CPU_REPACK / AMX bufts, so I think we need to maintain the CPU_Mapped model buffer + the extra bufts CPU_REPACK and AMX?

Is that what you meant?

slaren · 2025-09-28T19:39:27Z

What I mean is adding an option to skip adding the host buffer types here:

llama.cpp/src/llama-model.cpp

Lines 328 to 340 in bd0af02

    
           // add a host buffer type 
        
           // storing the tensors in a host buffer is useful when the processing of large batches 
        
           // is offloaded to a GPU device, since it reduces the time spent on data transfers 
        
           // generally, this will be done using the first device in the list 
        
           // a better approach would be to handle this on a weight-by-weight basis using the offload_op 
        
           // function of the device to determine if it would benefit from being stored in a host buffer 
        
           for (auto * dev : devices) { 
        
               ggml_backend_buffer_type_t buft = ggml_backend_dev_host_buffer_type(dev); 
        
               if (buft) { 
        
                   buft_list.emplace_back(dev, buft); 
        
                   break; 
        
               } 
        
           }

The reason the extra buffer types don't get used when there is a GPU, is because the host buffer types have higher priority. Alternatively, the option could give repack buffers higher priority, but still keep the host buffer types.

Gadflyii · 2025-09-28T20:57:18Z

What I mean is adding an option to skip adding the host buffer types here:

llama.cpp/src/llama-model.cpp

Lines 328 to 340 in bd0af02

// add a host buffer type

// storing the tensors in a host buffer is useful when the processing of large batches

// is offloaded to a GPU device, since it reduces the time spent on data transfers

// generally, this will be done using the first device in the list

// a better approach would be to handle this on a weight-by-weight basis using the offload_op

// function of the device to determine if it would benefit from being stored in a host buffer

for (auto * dev : devices) {

ggml_backend_buffer_type_t buft = ggml_backend_dev_host_buffer_type(dev);

if (buft) {

buft_list.emplace_back(dev, buft);

break;

}

}

The reason the extra buffer types don't get used when there is a GPU, is because the host buffer types have higher priority. Alternatively, the option could give repack buffers higher priority, but still keep the host buffer types.

I will make the change and update the PR

Gadflyii · 2025-09-29T17:34:24Z

@slaren any feedback on what the "opt-in" switch should be called? I can keep it "--amx" or I can make it more generic, something like "--xbuffers" in case there are any other extra buffers added in the future?

slaren · 2025-09-29T17:40:21Z

I am not sure what would be the opt-in switch. What I am proposing is a flag to disable host buffer types, and it should be called something like --no-host.

Gadflyii · 2025-09-30T16:46:12Z

@slaren all changes have been made.

All feedback welcomed, and thank you for all your help.

Gadflyii requested review from slaren, CISC and ggerganov as code owners September 28, 2025 15:44

github-actions bot added the examples label Sep 28, 2025

implement --no-host to disable host buffer

473a84f

Gadflyii force-pushed the PR branch from 121a130 to 473a84f Compare September 30, 2025 16:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Enable Intel AMX acceleration while in CPU/GPU hybrid with new "--amx" toggle. #16310

Enable Intel AMX acceleration while in CPU/GPU hybrid with new "--amx" toggle. #16310

Gadflyii commented Sep 28, 2025 •

edited

Loading

Uh oh!

Gadflyii commented Sep 28, 2025

Uh oh!

slaren commented Sep 28, 2025 •

edited

Loading

Uh oh!

Gadflyii commented Sep 28, 2025

Uh oh!

slaren commented Sep 28, 2025

Uh oh!

Gadflyii commented Sep 28, 2025

Uh oh!

Gadflyii commented Sep 29, 2025

Uh oh!

slaren commented Sep 29, 2025

Uh oh!

Gadflyii commented Sep 30, 2025

Uh oh!

Uh oh!

Enable Intel AMX acceleration while in CPU/GPU hybrid with new "--amx" toggle. #16310

Are you sure you want to change the base?

Enable Intel AMX acceleration while in CPU/GPU hybrid with new "--amx" toggle. #16310

Conversation

Gadflyii commented Sep 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Gadflyii commented Sep 28, 2025

Uh oh!

slaren commented Sep 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Gadflyii commented Sep 28, 2025

Uh oh!

slaren commented Sep 28, 2025

Uh oh!

Gadflyii commented Sep 28, 2025

Uh oh!

Gadflyii commented Sep 29, 2025

Uh oh!

slaren commented Sep 29, 2025

Uh oh!

Gadflyii commented Sep 30, 2025

Uh oh!

Uh oh!

Gadflyii commented Sep 28, 2025 •

edited

Loading

slaren commented Sep 28, 2025 •

edited

Loading