Skip to content

Conversation

jambayk
Copy link
Contributor

@jambayk jambayk commented Jul 22, 2025

  • Support new "olive" quant type
  • Weight and zero-point packings are the same as gptq. No g_idx.
  • Similar to the k_quant mixed precision int4_algo, select matmuls can be in 8 bits.
    • Currently, we ensure that the q_proj, k_proj, v_proj matmuls use the same configuration (bits and group_size) so that they can be merged without issues.
  • The modules are generalized to remove the requirement that all matmuls in a layer must have the same bits and group_size.
  • quant_weight and dequant_weight support no g_idx by using repeat_interleave. Otherwise, we have to create a trivial g_idx like the quark model does.
  • pack_ort_format supports 8 bit packing.

@jambayk jambayk requested a review from kunal-vaishnavi July 22, 2025 22:38
@natke natke added the 0.9.0 label Jul 24, 2025
@kunal-vaishnavi
Copy link
Contributor

Can we uncomment some of the CI models to test the quantized PyTorch to quantized ONNX path?

def get_model_paths():
# TODO: Uncomment the following models as needed in the CI pipeline.
hf_paths = {
"phi-2": "microsoft/phi-2",
# "olmo": "amd/AMD-OLMo-1B-SFT-DPO",
"qwen-2.5": "Qwen/Qwen2.5-0.5B",
# "phi-3.5": "microsoft/Phi-3.5-mini-instruct",
# "llama-3.2": "meta-llama/Llama-3.2-1B-instruct",
# "granite-3.0": "ibm-granite/granite-3.0-2b-instruct",
}
ci_data_path = os.path.join(get_ci_data_path(), "pytorch")
if not os.path.exists(ci_data_path):
return {}, hf_paths
# Note: If a model has over 4B parameters, please add a quantized version
# to `ci_paths` instead of `hf_paths` to reduce file size and testing time.
ci_paths = {
# "llama-2": os.path.join(ci_data_path, "Llama-2-7B-Chat-GPTQ"),
# "llama-3": os.path.join(ci_data_path, "Meta-Llama-3-8B-AWQ"),
# "mistral-v0.2": os.path.join(ci_data_path, "Mistral-7B-Instruct-v0.2-GPTQ"),
"phi-2": os.path.join(ci_data_path, "phi2"),
# "gemma-2b": os.path.join(ci_data_path, "gemma-1.1-2b-it"),
# "gemma-7b": os.path.join(ci_data_path, "gemma-7b-it-awq"),
# "phi-3-mini": os.path.join(ci_data_path, "phi3-mini-128k-instruct"),
# "gemma-2-2b": os.path.join(ci_data_path, "gemma-2-2b-it"),
# "llama-3.2": os.path.join(ci_data_path, "llama-3.2b-1b-instruct"),
"qwen-2.5": os.path.join(ci_data_path, "qwen2.5-0.5b-instruct"),
# "nemotron-mini": os.path.join(ci_data_path, "nemotron-mini-4b"),
}
return ci_paths, hf_paths

@jambayk
Copy link
Contributor Author

jambayk commented Jul 28, 2025

Can we uncomment some of the CI models to test the quantized PyTorch to quantized ONNX path?

def get_model_paths():
# TODO: Uncomment the following models as needed in the CI pipeline.
hf_paths = {
"phi-2": "microsoft/phi-2",
# "olmo": "amd/AMD-OLMo-1B-SFT-DPO",
"qwen-2.5": "Qwen/Qwen2.5-0.5B",
# "phi-3.5": "microsoft/Phi-3.5-mini-instruct",
# "llama-3.2": "meta-llama/Llama-3.2-1B-instruct",
# "granite-3.0": "ibm-granite/granite-3.0-2b-instruct",
}
ci_data_path = os.path.join(get_ci_data_path(), "pytorch")
if not os.path.exists(ci_data_path):
return {}, hf_paths
# Note: If a model has over 4B parameters, please add a quantized version
# to `ci_paths` instead of `hf_paths` to reduce file size and testing time.
ci_paths = {
# "llama-2": os.path.join(ci_data_path, "Llama-2-7B-Chat-GPTQ"),
# "llama-3": os.path.join(ci_data_path, "Meta-Llama-3-8B-AWQ"),
# "mistral-v0.2": os.path.join(ci_data_path, "Mistral-7B-Instruct-v0.2-GPTQ"),
"phi-2": os.path.join(ci_data_path, "phi2"),
# "gemma-2b": os.path.join(ci_data_path, "gemma-1.1-2b-it"),
# "gemma-7b": os.path.join(ci_data_path, "gemma-7b-it-awq"),
# "phi-3-mini": os.path.join(ci_data_path, "phi3-mini-128k-instruct"),
# "gemma-2-2b": os.path.join(ci_data_path, "gemma-2-2b-it"),
# "llama-3.2": os.path.join(ci_data_path, "llama-3.2b-1b-instruct"),
"qwen-2.5": os.path.join(ci_data_path, "qwen2.5-0.5b-instruct"),
# "nemotron-mini": os.path.join(ci_data_path, "nemotron-mini-4b"),
}
return ci_paths, hf_paths

I can do it as part of this PR but not sure which ones to uncomment.

@jambayk jambayk requested a review from kunal-vaishnavi July 28, 2025 19:15
@kunal-vaishnavi
Copy link
Contributor

Can we uncomment some of the CI models to test the quantized PyTorch to quantized ONNX path?

def get_model_paths():
# TODO: Uncomment the following models as needed in the CI pipeline.
hf_paths = {
"phi-2": "microsoft/phi-2",
# "olmo": "amd/AMD-OLMo-1B-SFT-DPO",
"qwen-2.5": "Qwen/Qwen2.5-0.5B",
# "phi-3.5": "microsoft/Phi-3.5-mini-instruct",
# "llama-3.2": "meta-llama/Llama-3.2-1B-instruct",
# "granite-3.0": "ibm-granite/granite-3.0-2b-instruct",
}
ci_data_path = os.path.join(get_ci_data_path(), "pytorch")
if not os.path.exists(ci_data_path):
return {}, hf_paths
# Note: If a model has over 4B parameters, please add a quantized version
# to `ci_paths` instead of `hf_paths` to reduce file size and testing time.
ci_paths = {
# "llama-2": os.path.join(ci_data_path, "Llama-2-7B-Chat-GPTQ"),
# "llama-3": os.path.join(ci_data_path, "Meta-Llama-3-8B-AWQ"),
# "mistral-v0.2": os.path.join(ci_data_path, "Mistral-7B-Instruct-v0.2-GPTQ"),
"phi-2": os.path.join(ci_data_path, "phi2"),
# "gemma-2b": os.path.join(ci_data_path, "gemma-1.1-2b-it"),
# "gemma-7b": os.path.join(ci_data_path, "gemma-7b-it-awq"),
# "phi-3-mini": os.path.join(ci_data_path, "phi3-mini-128k-instruct"),
# "gemma-2-2b": os.path.join(ci_data_path, "gemma-2-2b-it"),
# "llama-3.2": os.path.join(ci_data_path, "llama-3.2b-1b-instruct"),
"qwen-2.5": os.path.join(ci_data_path, "qwen2.5-0.5b-instruct"),
# "nemotron-mini": os.path.join(ci_data_path, "nemotron-mini-4b"),
}
return ci_paths, hf_paths

I can do it as part of this PR but not sure which ones to uncomment.

You can uncomment the models with GPTQ or AWQ in the name since they will go through quantized_model.py.

@kunal-vaishnavi kunal-vaishnavi enabled auto-merge (squash) July 30, 2025 17:22
@kunal-vaishnavi kunal-vaishnavi merged commit 4aee929 into main Jul 30, 2025
14 of 16 checks passed
@kunal-vaishnavi kunal-vaishnavi deleted the jambayk/olive-quant branch July 30, 2025 17:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants