PS C:\Users\raymo\llama.cpp> .\build\bin\Release\llama-cli.exe -hf ggml-org/gpt-oss-20b-GGUF --ctx-size 0 --jinja -ub 2048 -b 2048 -ngl 99 -fa ggml_vulkan: Found 1 Vulkan devices: ggml_vulkan: 0 = Intel(R) Arc(TM) 140V GPU (16GB) (Intel Corporation) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 32768 | int dot: 1 | matrix cores: KHR_coopmat curl_perform_with_retry: HEAD https://huggingface.co/ggml-org/gpt-oss-20b-GGUF/resolve/main/gpt-oss-20b-mxfp4.gguf (attempt 1 of 1)... common_download_file_single: using cached file: C:\Users\raymo\AppData\Local\llama.cpp\ggml-org_gpt-oss-20b-GGUF_gpt-oss-20b-mxfp4.gguf build: 6209 (fb22dd07) with MSVC 19.44.35215.0 for x64 main: llama backend init main: load the model and apply lora adapter, if any llama_model_load_from_file_impl: using device Vulkan0 (Intel(R) Arc(TM) 140V GPU (16GB)) - 24574 MiB free llama_model_loader: loaded meta data with 35 key-value pairs and 459 tensors from C:\Users\raymo\AppData\Local\llama.cpp\ggml-org_gpt-oss-20b-GGUF_gpt-oss-20b-mxfp4.gguf (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = gpt-oss llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Gpt Oss 20b llama_model_loader: - kv 3: general.basename str = gpt-oss llama_model_loader: - kv 4: general.size_label str = 20B llama_model_loader: - kv 5: general.license str = apache-2.0 llama_model_loader: - kv 6: general.tags arr[str,2] = ["vllm", "text-generation"] llama_model_loader: - kv 7: gpt-oss.block_count u32 = 24 llama_model_loader: - kv 8: gpt-oss.context_length u32 = 131072 llama_model_loader: - kv 9: gpt-oss.embedding_length u32 = 2880 llama_model_loader: - kv 10: gpt-oss.feed_forward_length u32 = 2880 llama_model_loader: - kv 11: gpt-oss.attention.head_count u32 = 64 llama_model_loader: - kv 12: gpt-oss.attention.head_count_kv u32 = 8 llama_model_loader: - kv 13: gpt-oss.rope.freq_base f32 = 150000.000000 llama_model_loader: - kv 14: gpt-oss.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 15: gpt-oss.expert_count u32 = 32 llama_model_loader: - kv 16: gpt-oss.expert_used_count u32 = 4 llama_model_loader: - kv 17: gpt-oss.attention.key_length u32 = 64 llama_model_loader: - kv 18: gpt-oss.attention.value_length u32 = 64 llama_model_loader: - kv 19: gpt-oss.attention.sliding_window u32 = 128 llama_model_loader: - kv 20: gpt-oss.expert_feed_forward_length u32 = 2880 llama_model_loader: - kv 21: gpt-oss.rope.scaling.type str = yarn llama_model_loader: - kv 22: gpt-oss.rope.scaling.factor f32 = 32.000000 llama_model_loader: - kv 23: gpt-oss.rope.scaling.original_context_length u32 = 4096 llama_model_loader: - kv 24: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 25: tokenizer.ggml.pre str = gpt-4o llama_model_loader: - kv 26: tokenizer.ggml.tokens arr[str,201088] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 27: tokenizer.ggml.token_type arr[i32,201088] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 28: tokenizer.ggml.merges arr[str,446189] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... llama_model_loader: - kv 29: tokenizer.ggml.bos_token_id u32 = 199998 llama_model_loader: - kv 30: tokenizer.ggml.eos_token_id u32 = 200002 llama_model_loader: - kv 31: tokenizer.ggml.padding_token_id u32 = 199999 llama_model_loader: - kv 32: tokenizer.chat_template str = {#-\n In addition to the normal input... llama_model_loader: - kv 33: general.quantization_version u32 = 2 llama_model_loader: - kv 34: general.file_type u32 = 38 llama_model_loader: - type f32: 289 tensors llama_model_loader: - type q8_0: 98 tensors llama_model_loader: - type mxfp4: 72 tensors print_info: file format = GGUF V3 (latest) print_info: file type = MXFP4 MoE print_info: file size = 11.27 GiB (4.63 BPW) load: printing all EOG tokens: load: - 199999 ('<|endoftext|>') load: - 200002 ('<|return|>') load: - 200007 ('<|end|>') load: - 200012 ('<|call|>') load: special_eog_ids contains both '<|return|>' and '<|call|>' tokens, removing '<|end|>' token from EOG list load: special tokens cache size = 21 load: token to piece cache size = 1.3332 MB print_info: arch = gpt-oss print_info: vocab_only = 0 print_info: n_ctx_train = 131072 print_info: n_embd = 2880 print_info: n_layer = 24 print_info: n_head = 64 print_info: n_head_kv = 8 print_info: n_rot = 64 print_info: n_swa = 128 print_info: is_swa_any = 1 print_info: n_embd_head_k = 64 print_info: n_embd_head_v = 64 print_info: n_gqa = 8 print_info: n_embd_k_gqa = 512 print_info: n_embd_v_gqa = 512 print_info: f_norm_eps = 0.0e+00 print_info: f_norm_rms_eps = 1.0e-05 print_info: f_clamp_kqv = 0.0e+00 print_info: f_max_alibi_bias = 0.0e+00 print_info: f_logit_scale = 0.0e+00 print_info: f_attn_scale = 0.0e+00 print_info: n_ff = 2880 print_info: n_expert = 32 print_info: n_expert_used = 4 print_info: causal attn = 1 print_info: pooling type = 0 print_info: rope type = 2 print_info: rope scaling = yarn print_info: freq_base_train = 150000.0 print_info: freq_scale_train = 0.03125 print_info: n_ctx_orig_yarn = 4096 print_info: rope_finetuned = unknown print_info: model type = 20B print_info: model params = 20.91 B print_info: general.name = Gpt Oss 20b print_info: n_ff_exp = 2880 print_info: vocab type = BPE print_info: n_vocab = 201088 print_info: n_merges = 446189 print_info: BOS token = 199998 '<|startoftext|>' print_info: EOS token = 200002 '<|return|>' print_info: EOT token = 199999 '<|endoftext|>' print_info: PAD token = 199999 '<|endoftext|>' print_info: LF token = 198 'Ċ' print_info: EOG token = 199999 '<|endoftext|>' print_info: EOG token = 200002 '<|return|>' print_info: EOG token = 200012 '<|call|>' print_info: max token length = 256 load_tensors: loading model tensors, this can take a while... (mmap = true) load_tensors: offloading 24 repeating layers to GPU load_tensors: offloading output layer to GPU load_tensors: offloaded 25/25 layers to GPU load_tensors: Vulkan0 model buffer size = 10949.35 MiB load_tensors: CPU_Mapped model buffer size = 586.82 MiB ................................................................................ llama_context: constructing llama_context llama_context: n_seq_max = 1 llama_context: n_ctx = 131072 llama_context: n_ctx_per_seq = 131072 llama_context: n_batch = 2048 llama_context: n_ubatch = 2048 llama_context: causal_attn = 1 llama_context: flash_attn = 1 llama_context: kv_unified = false llama_context: freq_base = 150000.0 llama_context: freq_scale = 0.03125 llama_context: Vulkan_Host output buffer size = 0.77 MiB llama_kv_cache_unified_iswa: creating non-SWA KV cache, size = 131072 cells llama_kv_cache_unified: Vulkan0 KV buffer size = 3072.00 MiB llama_kv_cache_unified: size = 3072.00 MiB (131072 cells, 12 layers, 1/1 seqs), K (f16): 1536.00 MiB, V (f16): 1536.00 MiB llama_kv_cache_unified_iswa: creating SWA KV cache, size = 2304 cells llama_kv_cache_unified: Vulkan0 KV buffer size = 54.00 MiB llama_kv_cache_unified: size = 54.00 MiB ( 2304 cells, 12 layers, 1/1 seqs), K (f16): 27.00 MiB, V (f16): 27.00 MiB llama_context: Vulkan0 compute buffer size = 1672.08 MiB llama_context: Vulkan_Host compute buffer size = 1064.59 MiB llama_context: graph nodes = 1352 llama_context: graph splits = 2 common_init_from_params: added <|endoftext|> logit bias = -inf common_init_from_params: added <|return|> logit bias = -inf common_init_from_params: added <|call|> logit bias = -inf common_init_from_params: setting dry_penalty_last_n to ctx_size = 131072 common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable) main: llama threadpool init, n_threads = 8 main: chat template is available, enabling conversation mode (disable it with -no-cnv) main: chat template example: <|start|>system<|message|>You are ChatGPT, a large language model trained by OpenAI. Knowledge cutoff: 2024-06 Current date: 2025-08-19 Reasoning: medium # Valid channels: analysis, commentary, final. Channel must be included for every message.<|end|><|start|>developer<|message|># Instructions You are a helpful assistant <|end|><|start|>user<|message|>Hello<|end|><|start|>assistant<|channel|>final<|message|>Hi there<|end|><|start|>user<|message|>How are you?<|end|><|start|>assistant system_info: n_threads = 8 (n_threads_batch = 8) / 8 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | main: interactive mode on. sampler seed: 222415949 sampler params: repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000 dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 131072 top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800 mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000 sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist generate: n_ctx = 131072, n_batch = 2048, n_predict = -1, n_keep = 0 == Running in interactive mode. == - Press Ctrl+C to interject at any time. - Press Return to return control to the AI. - To return control without starting a new line, end your input with '/'. - If you want to submit another line, end your input with '\'. - Not using system message. To change it, set a different value via -sys PROMPT > how are you doing sir? <|channel|>analysis<|message|>User says "how are you doing sir?" It's a polite greeting. Should respond politely. Also ask how user is doing.<|end|><|start|>assistant<|channel|>final<|message|>I’m doing well, thank you! How about you—how’s your day going?