Skip to content

Commit addb552

Browse files
Update PyTorch wheel URL and LLM README (#5748)
PyTorch wheel URL and version for 2.8 official release, add Qwen3 model in guide, rm tail space Co-authored-by: ZhangJianyu <[email protected]>
1 parent f7b49c9 commit addb552

File tree

3 files changed

+37
-37
lines changed

3 files changed

+37
-37
lines changed

dependency_version.json

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -4,24 +4,24 @@
44
"min-version": "12.3.0"
55
},
66
"pytorch": {
7-
"index-url": "https://download.pytorch.org/whl/nightly/xpu",
7+
"index-url": "https://download.pytorch.org/whl/xpu",
88
"version": {
9-
"linux": "2.8.0.dev20250618+xpu",
10-
"windows": "2.8.0.dev20250618+xpu"
9+
"linux": "2.8.0",
10+
"windows": "2.8.0"
1111
},
1212
"commit": "main"
1313
},
1414
"torchaudio": {
1515
"version": {
16-
"linux": "2.8.0.dev20250618+xpu",
17-
"windows": "2.8.0.dev20250619+xpu"
16+
"linux": "2.8.0",
17+
"windows": "2.8.0"
1818
},
1919
"commit": "main"
2020
},
2121
"torchvision": {
2222
"version": {
23-
"linux": "0.23.0.dev20250618+xpu",
24-
"windows": "0.23.0.dev20250619+xpu"
23+
"linux": "0.23.0",
24+
"windows": "0.23.0"
2525
},
2626
"commit": "main"
2727
},

examples/gpu/llm/README.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -2,12 +2,12 @@
22

33
Here you can find examples for large language models (LLM) text generation. These scripts:
44

5-
> [!NOTE]
6-
> New Llama models like Llama3.2-1B, Llama3.2-3B and Llama3.3-7B are also supported from release v2.8.10+xpu.
5+
> [!NOTE]
6+
> New Llama models like Qwen3-4B and Qwen3-8B are also supported from release v2.8.10+xpu.
77
88
- Include both inference/finetuning(lora)/bitsandbytes(qlora-finetuning).
99
- Include both single instance and distributed (DeepSpeed) use cases for FP16 optimization.
10-
- Support Llama, GPT-J, Qwen, OPT, Bloom model families and some other models such as Baichuan2-13B and Phi3-mini.
10+
- Support Llama, GPT-J, Qwen, OPT, Bloom model families and some other models such as Baichuan2-13B and Phi3-mini.
1111
- Cover model generation inference with low precision cases for different models with best performance and accuracy (fp16 AMP and weight only quantization)
1212

1313
## Environment Setup
@@ -124,7 +124,7 @@ where <br />
124124

125125

126126
<br />
127-
127+
128128
## How To Run LLM with ipex.llm
129129

130130
Inference and fine-tuning are supported in individual directories.

examples/gpu/llm/inference/README.md

Lines changed: 26 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
Here you can find the inference examples for large language models (LLM) text generation. These scripts:
44

5-
- Support Llama, GPT-J, Qwen, OPT, Bloom model families and some other Chinese models such as GLM4-9B, Baichuan2-13B and Phi3-mini.
5+
- Support Llama, GPT-J, Qwen, OPT, Bloom model families and some other Chinese models such as GLM4-9B, Baichuan2-13B and Phi3-mini.
66
- Include both single instance and distributed (DeepSpeed) use cases for FP16 optimization.
77
- Cover model generation inference with low precision cases for different models with best performance and accuracy (fp16 AMP and weight only quantization)
88

@@ -11,14 +11,14 @@ Here you can find the inference examples for large language models (LLM) text ge
1111

1212
Currently, only support Transformers 4.48.3. Support for newer versions of Transformers and more models will be available in the future.
1313

14-
| MODEL FAMILY | Verified < MODEL ID > (Huggingface hub)| FP16 | Weight only quantization INT4 | Optimized on Intel® Data Center GPU Max Series (1550/1100) | Optimized on Intel® Core™ Ultra Processors with Intel® Arc™ Graphics | Optimized on Intel® Arc™ B-Series Graphics (B580) |
14+
| MODEL FAMILY | Verified < MODEL ID > (Huggingface hub)| FP16 | Weight only quantization INT4 | Optimized on Intel® Data Center GPU Max Series (1550/1100) | Optimized on Intel® Core™ Ultra Processors with Intel® Arc™ Graphics | Optimized on Intel® Arc™ B-Series Graphics (B580) |
1515
|---|:---:|:---:|:---:|:---:|:---:|:---:|
1616
|Llama 2| "meta-llama/Llama-2-7b-hf", "meta-llama/Llama-2-13b-hf", "meta-llama/Llama-2-70b-hf" |||||$✅^1$|
1717
|Llama 3| "meta-llama/Meta-Llama-3-8B", "meta-llama/Meta-Llama-3-70B", "meta-llama/Llama-3.2-1B", "meta-llama/Llama-3.2-3B", "meta-llama/Llama-3.3-70B-Instruct" |||||$✅^2$|
1818
|Phi-3 mini| "microsoft/Phi-3-mini-128k-instruct", "microsoft/Phi-3-mini-4k-instruct", "microsoft/Phi-3.5-mini-instruct" |||||$✅^3$|
1919
|Mistral | "mistralai/Mistral-7B-Instruct-v0.2" ||||| |
2020
|GPT-J| "EleutherAI/gpt-j-6b" ||||| |
21-
|Qwen|"Qwen/Qwen2-7B", "Qwen/Qwen2-7B-Instruct", "Qwen/Qwen2.5-7B-Instruct" ||||| |
21+
|Qwen|"Qwen/Qwen2-7B", "Qwen/Qwen2-7B-Instruct", "Qwen/Qwen2.5-7B-Instruct", "Qwen/Qwen3-4B", "Qwen/Qwen3-8B" ||||| |
2222
|OPT|"facebook/opt-6.7b", "facebook/opt-30b"|| || |
2323
|Bloom|"bigscience/bloom-7b1", "bigscience/bloom"|| || |
2424
|GLM4-9B|"THUDM/glm-4-9b"|| || |
@@ -27,16 +27,16 @@ Currently, only support Transformers 4.48.3. Support for newer versions of Trans
2727
- ✅ signifies that it is supported.
2828

2929
- A blank signifies that it is not supported yet.
30-
30+
3131
- 1: signifies that Llama-2-7b-hf is verified.
32-
32+
3333
- 2: signifies that Meta-Llama-3-8B is verified.
34-
34+
3535
- 3: signifies that Phi-3-mini-4k-instruct is verified.
3636

3737

3838

39-
**Note**: The verified models mentioned above (including other models in the same model family, like "codellama/CodeLlama-7b-hf" from LLAMA family) are well-supported with all optimizations like indirect access KV cache and fused ROPE. For other LLM families, we are actively working to implement these optimizations, which will be reflected in the expanded model list above.
39+
**Note**: The verified models mentioned above (including other models in the same model family, like "codellama/CodeLlama-7b-hf" from LLAMA family) are well-supported with all optimizations like indirect access KV cache and fused ROPE. For other LLM families, we are actively working to implement these optimizations, which will be reflected in the expanded model list above.
4040

4141
## Supported Platforms
4242

@@ -148,9 +148,9 @@ tokenizer = AutoTokenizer.from_pretrained(model_id)
148148

149149
# Define the quantization configuration
150150
woq_quantization_config = RtnConfig(
151-
compute_dtype="fp16",
152-
weight_dtype="int4_fullrange",
153-
scale_dtype="fp16",
151+
compute_dtype="fp16",
152+
weight_dtype="int4_fullrange",
153+
scale_dtype="fp16",
154154
group_size=64
155155
)
156156
# Load the model and apply quantization
@@ -167,9 +167,9 @@ model = model.to(memory_format=torch.channels_last)
167167

168168
# Optimize the model with Intel Extension for PyTorch (IPEX)
169169
model = ipex.llm.optimize(
170-
model.eval(),
171-
device="xpu",
172-
inplace=True,
170+
model.eval(),
171+
device="xpu",
172+
inplace=True,
173173
quantization_config=woq_quantization_config
174174
)
175175

@@ -226,19 +226,19 @@ if os.path.exists(woq_checkpoint_path):
226226
print("Directly loading already quantized model")
227227
# Load the already quantized model
228228
model = AutoModelForCausalLM.from_pretrained(
229-
woq_checkpoint_path,
230-
trust_remote_code=use_hf_code,
231-
device_map="xpu",
229+
woq_checkpoint_path,
230+
trust_remote_code=use_hf_code,
231+
device_map="xpu",
232232
torch_dtype=torch.float16
233233
)
234234
model = model.to(memory_format=torch.channels_last)
235235
woq_quantization_config = getattr(model, "quantization_config", None)
236236
else:
237237
# Define the quantization configuration
238238
woq_quantization_config = RtnConfig(
239-
compute_dtype="fp16",
240-
weight_dtype="int4_fullrange",
241-
scale_dtype="fp16",
239+
compute_dtype="fp16",
240+
weight_dtype="int4_fullrange",
241+
scale_dtype="fp16",
242242
group_size=64
243243
)
244244
# Load the model and apply quantization
@@ -258,9 +258,9 @@ model = model.to(memory_format=torch.channels_last)
258258

259259
# Optimize the model with Intel Extension for PyTorch (IPEX)
260260
model = ipex.llm.optimize(
261-
model.eval(),
262-
device="xpu",
263-
inplace=True,
261+
model.eval(),
262+
device="xpu",
263+
inplace=True,
264264
quantization_config=woq_quantization_config
265265
)
266266

@@ -316,9 +316,9 @@ if os.path.exists(woq_checkpoint_path):
316316
print("Directly loading already quantized model")
317317
# Load the quantized model
318318
model = AutoModelForCausalLM.from_pretrained(
319-
woq_checkpoint_path,
320-
trust_remote_code=use_hf_code,
321-
device_map="xpu",
319+
woq_checkpoint_path,
320+
trust_remote_code=use_hf_code,
321+
device_map="xpu",
322322
torch_dtype=torch.float16
323323
)
324324
model = model.to(memory_format=torch.channels_last)
@@ -378,7 +378,7 @@ mpirun -np 2 --prepend-rank python -u run_generation_with_deepspeed.py --benchma
378378

379379
**Note**: Unset the variable before running.
380380
```bash
381-
# Adding this variable to run multi-tile cases might cause an Out Of Memory (OOM) issue.
381+
# Adding this variable to run multi-tile cases might cause an Out Of Memory (OOM) issue.
382382
unset TORCH_LLM_ALLREDUCE
383383
```
384384

0 commit comments

Comments
 (0)