llama.cpp becomes significantly slower when loaded two different models

**Describe the bug**
With Intel Arc B580, when I use llama-cli to `gemma-3-12b-it-q4_0.gguf` I can get about 17 tokens per sec if I run it for the first time after PC boot.

However after PC boot, if I first run some other model (say `Qwen3-8B-Q4_K_M.gguf`) with llama-cli, exit from its shell and then run `gemma-3-12b-it-q4_0.gguf`, I can only get about 3 tokens per sec.

This symptom continued until I rebooted my PC. 

**How to reproduce**
Steps to reproduce the error:
1. Boot your PC
2. `llama-cpp-ipex-llm/llama-cli.exe -m gemma-3-12b-it-q4_0.gguf --n-gpu-layers 50`
3. type `hello`, wait for response and exit from the llama-cli shell -> about **17 tokens per sec**
4. Reboot your PC
3. `llama-cpp-ipex-llm/llama-cli.exe -m Qwen3-8B-Q4_K_M.gguf --n-gpu-layers 50`
5. type `hello`, wait for response and exit from the llama-cli shell -> about 65 tokens per sec
6.  `llama-cpp-ipex-llm/llama-cli.exe -m gemma-3-12b-it-q4_0.gguf --n-gpu-layers 50`
7. type `hello`, wait for response and exit from the llama-cli shell -> now only **3 tokens per sec**
8. Reboot your PC
9.  `llama-cpp-ipex-llm/llama-cli.exe -m gemma-3-12b-it-q4_0.gguf --n-gpu-layers 50`
10. type `hello`, wait for response and exit from the llama-cli shell -> again **17 tokens per sec**

**Environment information**
- OS: Windows 11 24H2
- CPU: Ryzen 5800
- RAM: 32GB
- GPU: Intel Arc B580
- Intel GPU Driver Version: 32.0.101.7026
- llama.cpp version: llama-cpp-ipex-llm-2.3.0b20250729-win downloaded from https://github.com/ipex-llm/ipex-llm/releases/tag/v2.3.0-nightly


**Additional context**
The same problem also happens with ollama-ipex-llm-2.3.0b20250725-win

[edit]
With llama.cpp official build version b6509 vulkan backend (https://github.com/ggml-org/llama.cpp/releases/tag/b6509) I can always about 35 tokens per sec for gemma3-12b regardless of its loading order.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

llama.cpp becomes significantly slower when loaded two different models #13307

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

llama.cpp becomes significantly slower when loaded two different models #13307

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions