-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Open
Description
Hi,At first, I intended to use torch.compile to speed up inference, but an error occurred:
xe_linear.forward_new
from user code:
File "D:\miniconda3\envs\compile\Lib\site-packages\transformers\models\qwen2_5_omni\modeling_qwen2_5_omni.py", line 1838, in _forward_native
hidden_states, self_attn_weights, present_key_value = self.self_attn(
File "D:\miniconda3\envs\compile\Lib\site-packages\torch\nn\modules\module.py", line 1784, in _call_impl
return forward_call(*args, **kwargs)
File "D:\miniconda3\envs\compile\Lib\site-packages\ipex_llm\transformers\models\qwen2_5_omni.py", line 251, in qwen2_5_omni_attention_forward
qkv = self.qkv_proj(hidden_states)
File "D:\miniconda3\envs\compile\Lib\site-packages\ipex_llm\transformers\low_bit_linear.py", line 711, in forward
result = xe_linear.forward_new(x_2d, w, self.qtype, self.out_len)
Set TORCHDYNAMO_VERBOSE=1 for the internal stack trace (please do this especially if you're reporting a bug to PyTorch). For even more developer context, set TORCH_LOGS="+dynamo"
when i try to use F.linear to instead self.qkv_proj get error:
RuntimeError: self and mat2 must have the same dtype, but got BFloat16 and Byte
so i konw there are two operations in xe_linear.forward_new : dequantize + GEMM.
Therefore, I want to implement a custom operator for de-quantization.
Here is my questions:
- When low_bit='woq_int4', how can i get scale parameter ? In the model weights, there are quantified weights. As I understand, each uint8 stores 2 int4 weights, and then after every 64 weights (32 uint8 weights), there is a scale.But i'm not sure how to get the scale. Since QK=64, block_size_in_bytes=34. maybe every 34 bytes is 64*int4 weight + fp16 scale.
- Is the quantitative range [-8,7] or [-7,7]?
- packing sequence
Can you give me some advice?
Metadata
Metadata
Assignees
Labels
No labels