-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Description
Describe the bug
With Intel Arc B580, when I use llama-cli to gemma-3-12b-it-q4_0.gguf
I can get about 17 tokens per sec if I run it for the first time after PC boot.
However after PC boot, if I first run some other model (say Qwen3-8B-Q4_K_M.gguf
) with llama-cli, exit from its shell and then run gemma-3-12b-it-q4_0.gguf
, I can only get about 3 tokens per sec.
This symptom continued until I rebooted my PC.
How to reproduce
Steps to reproduce the error:
- Boot your PC
llama-cpp-ipex-llm/llama-cli.exe -m gemma-3-12b-it-q4_0.gguf --n-gpu-layers 50
- type
hello
, wait for response and exit from the llama-cli shell -> about 17 tokens per sec - Reboot your PC
llama-cpp-ipex-llm/llama-cli.exe -m Qwen3-8B-Q4_K_M.gguf --n-gpu-layers 50
- type
hello
, wait for response and exit from the llama-cli shell -> about 65 tokens per sec llama-cpp-ipex-llm/llama-cli.exe -m gemma-3-12b-it-q4_0.gguf --n-gpu-layers 50
- type
hello
, wait for response and exit from the llama-cli shell -> now only 3 tokens per sec - Reboot your PC
llama-cpp-ipex-llm/llama-cli.exe -m gemma-3-12b-it-q4_0.gguf --n-gpu-layers 50
- type
hello
, wait for response and exit from the llama-cli shell -> again 17 tokens per sec
Environment information
- OS: Windows 11 24H2
- CPU: Ryzen 5800
- RAM: 32GB
- GPU: Intel Arc B580
- Intel GPU Driver Version: 32.0.101.7026
- llama.cpp version: llama-cpp-ipex-llm-2.3.0b20250729-win downloaded from https://github.com/ipex-llm/ipex-llm/releases/tag/v2.3.0-nightly
Additional context
The same problem also happens with ollama-ipex-llm-2.3.0b20250725-win
[edit]
With llama.cpp official build version b6509 vulkan backend (https://github.com/ggml-org/llama.cpp/releases/tag/b6509) I can always about 35 tokens per sec for gemma3-12b regardless of its loading order.