Replies: 2 comments 2 replies
-
I think the main obstacle for more general-purpose implementation of this is that we don't know which tensors are the "outputs" of the graph. |
Beta Was this translation helpful? Give feedback.
0 replies
-
Is the proposal: (A) "data types based on inference" meaning that the ggml library chooses dst->type of F16 vs F32 based on some heuristic? I think there's a third option: (C) Backends can choose to use F16 as a fusion optimization to remove F16->F32->F16 conversions that only occur due to the F32 storage format. |
Beta Was this translation helpful? Give feedback.
2 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
This discussion is talking about using FP16 as the data type for intermediate results in graph inference, reducing computation and improving inference speed. Verification was conducted with the CANN backend on Qwen2.5, Qwen3-MoE, and DeepSeek-Lite-V2, showing performance improvements of 3%–10% depending on the concurrency and model.
The main changes in the demo include modifying operators involved in graph by replacing hardcoded FP32 data types with type inference based on input, adding FP16 support for GET_ROWS, and casting t_embd and t_logits back to FP32 at the end of inference.
In fact, this is only a very basic validation. For full FP16 support, the following are still needed:
see #16270 #16251
Beta Was this translation helpful? Give feedback.
All reactions