Skip to content

llama.cpp becomes significantly slower when loaded two different models #13307

@lucidfrontier45

Description

@lucidfrontier45

Describe the bug
With Intel Arc B580, when I use llama-cli to gemma-3-12b-it-q4_0.gguf I can get about 17 tokens per sec if I run it for the first time after PC boot.

However after PC boot, if I first run some other model (say Qwen3-8B-Q4_K_M.gguf) with llama-cli, exit from its shell and then run gemma-3-12b-it-q4_0.gguf, I can only get about 3 tokens per sec.

This symptom continued until I rebooted my PC.

How to reproduce
Steps to reproduce the error:

  1. Boot your PC
  2. llama-cpp-ipex-llm/llama-cli.exe -m gemma-3-12b-it-q4_0.gguf --n-gpu-layers 50
  3. type hello, wait for response and exit from the llama-cli shell -> about 17 tokens per sec
  4. Reboot your PC
  5. llama-cpp-ipex-llm/llama-cli.exe -m Qwen3-8B-Q4_K_M.gguf --n-gpu-layers 50
  6. type hello, wait for response and exit from the llama-cli shell -> about 65 tokens per sec
  7. llama-cpp-ipex-llm/llama-cli.exe -m gemma-3-12b-it-q4_0.gguf --n-gpu-layers 50
  8. type hello, wait for response and exit from the llama-cli shell -> now only 3 tokens per sec
  9. Reboot your PC
  10. llama-cpp-ipex-llm/llama-cli.exe -m gemma-3-12b-it-q4_0.gguf --n-gpu-layers 50
  11. type hello, wait for response and exit from the llama-cli shell -> again 17 tokens per sec

Environment information

Additional context
The same problem also happens with ollama-ipex-llm-2.3.0b20250725-win

[edit]
With llama.cpp official build version b6509 vulkan backend (https://github.com/ggml-org/llama.cpp/releases/tag/b6509) I can always about 35 tokens per sec for gemma3-12b regardless of its loading order.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions