-
-
Notifications
You must be signed in to change notification settings - Fork 177
Open
Labels
bugSomething isn't workingSomething isn't working
Description
🐛 Bug Description
I am not able to use GPU with --gpu-backend switch on 1.5.6 (self compile)
🔄 Steps to Reproduce
- Compile using cargo build --release --no-default-features --features huggingface,llama-opencl,llama-vulkan
- shimmy -V
PS C:\Users\DELL\downloads\shimmy\shimmy\target\release> ./shimmy.exe -V
shimmy 1.5.6
- shimmy gpu-info
PS C:\Users\DELL\downloads\shimmy\shimmy\target\release> ./shimmy.exe gpu-info
🖥️ GPU Backend Information
🔧 llama.cpp Backend: Vulkan
📋 Available GPU Features:
❌ CUDA support disabled
✅ Vulkan support enabled
✅ OpenCL support enabled
🍎 MLX Backend: Disabled (compile with --features mlx)
💡 To enable GPU acceleration:
cargo install shimmy --features llama-cuda # NVIDIA CUDA
cargo install shimmy --features llama-vulkan # Cross-platform Vulkan
cargo install shimmy --features llama-opencl # AMD/Intel OpenCL
cargo install shimmy --features gpu # All GPU backends
PS C:\Users\DELL\downloads\shimmy\shimmy\target\release>
- shimmy serve --gpu-backend auto (I have tried vulkan or opencl too) same result
✅ Expected Behavior
GPU is used instead of CPU
❌ Actual Behavior
CPU is used (verfied with 100% CPU time)
📦 Shimmy Version
Latest (main branch)
💻 Operating System
Windows
📥 Installation Method
Built from source (cargo build)
🌍 Environment Details
- GPU:

📋 Logs/Error Messages
PS C:\Users\DELL\downloads\shimmy\shimmy\target\release> ./shimmy.exe serve --gpu-backend auto
🚀 Starting Shimmy server on 127.0.0.1:11435
llama_model_loader: loaded meta data with 27 key-value pairs and 339 tensors from C:\Users\DELL\Downloads\unsloth.DeepSeek-R1-Distil-Qwen-7B\DeepSeek-R1-Distill-Qwen-7B-Q6_K.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = qwen2
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = DeepSeek R1 Distill Qwen 7B
llama_model_loader: - kv 3: general.organization str = Deepseek Ai
llama_model_loader: - kv 4: general.basename str = DeepSeek-R1-Distill-Qwen
llama_model_loader: - kv 5: general.size_label str = 7B
llama_model_loader: - kv 6: qwen2.block_count u32 = 28
llama_model_loader: - kv 7: qwen2.context_length u32 = 131072
llama_model_loader: - kv 8: qwen2.embedding_length u32 = 3584
llama_model_loader: - kv 9: qwen2.feed_forward_length u32 = 18944
llama_model_loader: - kv 10: qwen2.attention.head_count u32 = 28
llama_model_loader: - kv 11: qwen2.attention.head_count_kv u32 = 4
llama_model_loader: - kv 12: qwen2.rope.freq_base f32 = 10000.000000
llama_model_loader: - kv 13: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 14: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 15: tokenizer.ggml.pre str = deepseek-r1-qwen
llama_model_loader: - kv 16: tokenizer.ggml.tokens arr[str,152064] = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 17: tokenizer.ggml.token_type arr[i32,152064] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 18: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv 19: tokenizer.ggml.bos_token_id u32 = 151646
llama_model_loader: - kv 20: tokenizer.ggml.eos_token_id u32 = 151643
llama_model_loader: - kv 21: tokenizer.ggml.padding_token_id u32 = 151654
llama_model_loader: - kv 22: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 23: tokenizer.ggml.add_eos_token bool = false
llama_model_loader: - kv 24: tokenizer.chat_template str = {% if not add_generation_prompt is de...
llama_model_loader: - kv 25: general.quantization_version u32 = 2
llama_model_loader: - kv 26: general.file_type u32 = 18
llama_model_loader: - type f32: 141 tensors
llama_model_loader: - type q6_K: 198 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q6_K
print_info: file size = 5.82 GiB (6.56 BPW)
init_tokenizer: initializing tokenizer for type 2
load: control token: 151661 '<|fim_suffix|>' is not marked as EOG
load: control token: 151647 '<|EOT|>' is not marked as EOG
load: control token: 151654 '<|vision_pad|>' is not marked as EOG
load: control token: 151659 '<|fim_prefix|>' is not marked as EOG
load: control token: 151646 '<|begin▁of▁sentence|>' is not marked as EOG
load: control token: 151643 '<|end▁of▁sentence|>' is not marked as EOG
load: control token: 151644 '<|User|>' is not marked as EOG
load: control token: 151645 '<|Assistant|>' is not marked as EOG
load: control token: 151650 '<|quad_start|>' is not marked as EOG
load: control token: 151651 '<|quad_end|>' is not marked as EOG
load: control token: 151652 '<|vision_start|>' is not marked as EOG
load: control token: 151653 '<|vision_end|>' is not marked as EOG
load: control token: 151655 '<|image_pad|>' is not marked as EOG
load: control token: 151656 '<|video_pad|>' is not marked as EOG
load: control token: 151660 '<|fim_middle|>' is not marked as EOG
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: printing all EOG tokens:
load: - 151643 ('<|end▁of▁sentence|>')
load: - 151662 ('<|fim_pad|>')
load: - 151663 ('<|repo_name|>')
load: - 151664 ('<|file_sep|>')
load: special tokens cache size = 22
load: token to piece cache size = 0.9310 MB
print_info: arch = qwen2
print_info: vocab_only = 0
print_info: n_ctx_train = 131072
print_info: n_embd = 3584
print_info: n_layer = 28
print_info: n_head = 28
print_info: n_head_kv = 4
print_info: n_rot = 128
print_info: n_swa = 0
print_info: is_swa_any = 0
print_info: n_embd_head_k = 128
print_info: n_embd_head_v = 128
print_info: n_gqa = 7
print_info: n_embd_k_gqa = 512
print_info: n_embd_v_gqa = 512
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-06
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 0.0e+00
print_info: n_ff = 18944
print_info: n_expert = 0
print_info: n_expert_used = 0
print_info: causal attn = 1
print_info: pooling type = -1
print_info: rope type = 2
print_info: rope scaling = linear
print_info: freq_base_train = 10000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 131072
print_info: rope_finetuned = unknown
print_info: model type = 7B
print_info: model params = 7.62 B
print_info: general.name = DeepSeek R1 Distill Qwen 7B
print_info: vocab type = BPE
print_info: n_vocab = 152064
print_info: n_merges = 151387
print_info: BOS token = 151646 '<|begin▁of▁sentence|>'
print_info: EOS token = 151643 '<|end▁of▁sentence|>'
print_info: EOT token = 151643 '<|end▁of▁sentence|>'
print_info: PAD token = 151654 '<|vision_pad|>'
print_info: LF token = 198 'Ċ'
print_info: FIM PRE token = 151659 '<|fim_prefix|>'
print_info: FIM SUF token = 151661 '<|fim_suffix|>'
print_info: FIM MID token = 151660 '<|fim_middle|>'
print_info: FIM PAD token = 151662 '<|fim_pad|>'
print_info: FIM REP token = 151663 '<|repo_name|>'
print_info: FIM SEP token = 151664 '<|file_sep|>'
print_info: EOG token = 151643 '<|end▁of▁sentence|>'
print_info: EOG token = 151662 '<|fim_pad|>'
print_info: EOG token = 151663 '<|repo_name|>'
print_info: EOG token = 151664 '<|file_sep|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: layer 0 assigned to device CPU, is_swa = 0
load_tensors: layer 1 assigned to device CPU, is_swa = 0
load_tensors: layer 2 assigned to device CPU, is_swa = 0
load_tensors: layer 3 assigned to device CPU, is_swa = 0
load_tensors: layer 4 assigned to device CPU, is_swa = 0
load_tensors: layer 5 assigned to device CPU, is_swa = 0
load_tensors: layer 6 assigned to device CPU, is_swa = 0
load_tensors: layer 7 assigned to device CPU, is_swa = 0
load_tensors: layer 8 assigned to device CPU, is_swa = 0
load_tensors: layer 9 assigned to device CPU, is_swa = 0
load_tensors: layer 10 assigned to device CPU, is_swa = 0
load_tensors: layer 11 assigned to device CPU, is_swa = 0
load_tensors: layer 12 assigned to device CPU, is_swa = 0
load_tensors: layer 13 assigned to device CPU, is_swa = 0
load_tensors: layer 14 assigned to device CPU, is_swa = 0
load_tensors: layer 15 assigned to device CPU, is_swa = 0
load_tensors: layer 16 assigned to device CPU, is_swa = 0
load_tensors: layer 17 assigned to device CPU, is_swa = 0
load_tensors: layer 18 assigned to device CPU, is_swa = 0
load_tensors: layer 19 assigned to device CPU, is_swa = 0
load_tensors: layer 20 assigned to device CPU, is_swa = 0
load_tensors: layer 21 assigned to device CPU, is_swa = 0
load_tensors: layer 22 assigned to device CPU, is_swa = 0
load_tensors: layer 23 assigned to device CPU, is_swa = 0
load_tensors: layer 24 assigned to device CPU, is_swa = 0
load_tensors: layer 25 assigned to device CPU, is_swa = 0
load_tensors: layer 26 assigned to device CPU, is_swa = 0
load_tensors: layer 27 assigned to device CPU, is_swa = 0
load_tensors: layer 28 assigned to device CPU, is_swa = 0
load_tensors: tensor 'token_embd.weight' (q6_K) (and 338 others) cannot be used with preferred buffer type CPU_REPACK, using CPU instead
load_tensors: CPU_Mapped model buffer size = 5958.79 MiB
........................................................................................
llama_context: constructing llama_context
llama_context: n_seq_max = 1
llama_context: n_ctx = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch = 2048
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = 0
llama_context: kv_unified = false
llama_context: freq_base = 10000.0
llama_context: freq_scale = 1
llama_context: n_ctx_per_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
set_abort_callback: call
llama_context: CPU output buffer size = 0.58 MiB
create_memory: n_ctx = 4096 (padded)
llama_kv_cache_unified: layer 0: dev = CPU
llama_kv_cache_unified: layer 1: dev = CPU
llama_kv_cache_unified: layer 2: dev = CPU
llama_kv_cache_unified: layer 3: dev = CPU
llama_kv_cache_unified: layer 4: dev = CPU
llama_kv_cache_unified: layer 5: dev = CPU
llama_kv_cache_unified: layer 6: dev = CPU
llama_kv_cache_unified: layer 7: dev = CPU
llama_kv_cache_unified: layer 8: dev = CPU
llama_kv_cache_unified: layer 9: dev = CPU
llama_kv_cache_unified: layer 10: dev = CPU
llama_kv_cache_unified: layer 11: dev = CPU
llama_kv_cache_unified: layer 12: dev = CPU
llama_kv_cache_unified: layer 13: dev = CPU
llama_kv_cache_unified: layer 14: dev = CPU
llama_kv_cache_unified: layer 15: dev = CPU
llama_kv_cache_unified: layer 16: dev = CPU
llama_kv_cache_unified: layer 17: dev = CPU
llama_kv_cache_unified: layer 18: dev = CPU
llama_kv_cache_unified: layer 19: dev = CPU
llama_kv_cache_unified: layer 20: dev = CPU
llama_kv_cache_unified: layer 21: dev = CPU
llama_kv_cache_unified: layer 22: dev = CPU
llama_kv_cache_unified: layer 23: dev = CPU
llama_kv_cache_unified: layer 24: dev = CPU
llama_kv_cache_unified: layer 25: dev = CPU
llama_kv_cache_unified: layer 26: dev = CPU
llama_kv_cache_unified: layer 27: dev = CPU
llama_kv_cache_unified: CPU KV buffer size = 224.00 MiB
llama_kv_cache_unified: size = 224.00 MiB ( 4096 cells, 28 layers, 1/1 seqs), K (f16): 112.00 MiB, V (f16): 112.00 MiB
llama_context: enumerating backends
llama_context: backend_ptrs.size() = 1
llama_context: max_nodes = 2712
llama_context: worst-case: n_tokens = 512, n_seqs = 1, n_outputs = 0
graph_reserve: reserving a graph for ubatch with n_tokens = 512, n_seqs = 1, n_outputs = 512
graph_reserve: reserving a graph for ubatch with n_tokens = 1, n_seqs = 1, n_outputs = 1
graph_reserve: reserving a graph for ubatch with n_tokens = 512, n_seqs = 1, n_outputs = 512
llama_context: CPU compute buffer size = 304.00 MiB
llama_context: graph nodes = 1070
llama_context: graph splits = 1
📝 Additional Context
I have tried llama (with vulkan) on windows and the GPU is being used - so hardware is tested to be working.
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working