Support FP16 as intermediate results in graph computation #16271

hipudding · 2025-09-26T06:19:59Z

hipudding
Sep 26, 2025
Collaborator

This discussion is talking about using FP16 as the data type for intermediate results in graph inference, reducing computation and improving inference speed. Verification was conducted with the CANN backend on Qwen2.5, Qwen3-MoE, and DeepSeek-Lite-V2, showing performance improvements of 3%–10% depending on the concurrency and model.

The main changes in the demo include modifying operators involved in graph by replacing hardcoded FP32 data types with type inference based on input, adding FP16 support for GET_ROWS, and casting t_embd and t_logits back to FP32 at the end of inference.

In fact, this is only a very basic validation. For full FP16 support, the following are still needed:

Modify all operators that currently hardcode FP32 to perform type inference based on the input data type.
Add FP16 support to all backend operators.
Extend test cases to include FP16 data types.

see #16270 #16251

ggerganov · 2025-09-26T13:55:42Z

ggerganov
Sep 26, 2025
Maintainer

I think the main obstacle for more general-purpose implementation of this is that we don't know which tensors are the "outputs" of the graph.

0 replies

jeffbolznv · 2025-09-27T02:36:30Z

jeffbolznv
Sep 27, 2025
Collaborator

Is the proposal:

(A) "data types based on inference" meaning that the ggml library chooses dst->type of F16 vs F32 based on some heuristic?
(B) The application can explicitly request an F16 vs F32 dst->type for each operation?

I think there's a third option: (C) Backends can choose to use F16 as a fusion optimization to remove F16->F32->F16 conversions that only occur due to the F32 storage format.

2 replies

ggerganov Sep 27, 2025
Maintainer

A and B would be much more difficult to support because it would require more general support for F16 results in the CPU backend.

I was thinking about C where the backend can decide which F32 outputs to change to F16/BF16 - either by fusion, or by an initial pass over the nodes to change the types of the nodes. But without knowing which nodes are outputs of the current graph split (i.e. need to remain in F32) it's not clear how to implement.

I also now realize that the existing fusion logic might be suffering from the same problem - we could accidentally fuse a node that is going to be the input for a next graph split. I don't think we have a logic currently to prevent that from happening?

JohannesGaessler Sep 28, 2025
Collaborator

General BF16 support in the CPU backend may be more viable since even without hardware support you can do the FP32 <-> BF16 conversions via simple bitwise operations. My opinion is that if we add more types we would benefit a lot from deduplicating the code though.

Would it be possible to set some global, default float type when building the ggml graph and to shift responsibility for handling that type to user code?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support FP16 as intermediate results in graph computation #16271

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Support FP16 as intermediate results in graph computation #16271

Uh oh!

Uh oh!

hipudding Sep 26, 2025 Collaborator

Replies: 2 comments · 2 replies

Uh oh!

ggerganov Sep 26, 2025 Maintainer

Uh oh!

jeffbolznv Sep 27, 2025 Collaborator

Uh oh!

ggerganov Sep 27, 2025 Maintainer

Uh oh!

JohannesGaessler Sep 28, 2025 Collaborator

hipudding
Sep 26, 2025
Collaborator

Replies: 2 comments 2 replies

ggerganov
Sep 26, 2025
Maintainer

jeffbolznv
Sep 27, 2025
Collaborator

ggerganov Sep 27, 2025
Maintainer

JohannesGaessler Sep 28, 2025
Collaborator