Add Gemma3 architecture (text-only) #711

alvarobartt · 2025-09-04T14:32:40Z

What does this PR do?

This PR adds support for the Gemma3 architecture only for text, running on FP32 precision on CPU, MPS and CUDA devices.

Note

Neither FP16 nor Flash Attention are supported for Gemma3, given that FP16 is not supported due to weight instability leading to nan values in the hidden states.

Additionally, this PR also adds support for the ONNX export, which contains the embeddings at sentence_embeddings rather than the last_hidden_states from the Transformer layer, as some models might contain Transformer + Pooling + Dense modules, meaning that the ONNX export comes with all those packed together, unlike standard Transformer + Dense ONNX exports that output the last_hidden_states only so that the pooling is applied separately afterwards.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline, Pull Request section?
Was this discussed/approved via a GitHub issue or the forum? Please add a link to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

@Narsil

…idirectional

…on`)

…nario (#689) Signed-off-by: Liu, Kaixuan <[email protected]>

…697)

…700)

Signed-off-by: Liu, Kaixuan <[email protected]>

Some models contain Dense layers after the Transformer + Pooling, which means that the ONNX export might already contain the sentence embeddings given that the export needs to come with Transformer + Pooling + Dense. This commit adds the `outputs.contains_key("sentence_embedding")` handling to overcome that.

Otherwise it will default to `DType::Float16` when `--features candle-cuda`

alvarobartt and others added 30 commits July 25, 2025 16:46

Add serde alias for gelu_pytorch_tanh to Gelu

c92be89

Restructure some GTE imports

bf99ea2

Add Gemma3 implementation (WIP)

3db49ca

Sort imports and add Gemma3 loading

7979209

Remove use_bidirectional_attention as sliding attention is always b…

9eb10c8

…idirectional

Rename Gemma3 to TinyGemma

8e1710b

Use query_pre_attn_scalar for attention scaling

fca9b1c

Add rope_local_base_freq base for local attention (`sliding_attenti…

6ef6733

…on`)

Skip DType::F16 for TinyGemma (fails on Transformers too)

1bdb605

Use pad_token_id instead of eos_token_id

6dc1b3f

Add TinyGemmaLayerNorm (without bias)

247ed64

Use broadcast_add when applying attention bias masking

3f0ec54

Fix attention_type match-clause (it was reversed)

5bf4740

Replace LayerNorm with RMSNorm in decoder

e950f57

Fix full_attention (attend to all tokens)

ccf7e85

Remove unused Gemma3 pooling methods (CLS, Last Token)

6bbc181

Use right-padding instead of left-padding

62a3d86

Rename TinyGemma to Gemma3 (again)

c8e31c3

Add CUDA support for Gemma3 in FP32 (non-flash)

235fffd

Optimize attention by fusing KQV projection

8ea46f5

Update version to 1.8.0 (#686)

6439582

Squash merge main into new-model-addition

4de5e18

Add position_ids as ort::inputs! for inference

1a985ad

Conditionally add position_ids if required for ort

f4b1172

Adjust HPU warmup: use dummy inputs with shape more close to real sce…

527ce0a

…nario (#689) Signed-off-by: Liu, Kaixuan <[email protected]>

Add extra_args to trufflehog to exclude unverified results (#696)

ce48fb6

Update GitHub templates & fix mentions to Text Embeddings Inference (#…

6b415aa

…697)

Disable Flash Attention with USE_FLASH_ATTENTION (#692)

b3f136a

Add support for position_ids and past_key_values in OrtBackend (#…

15b9ce5

…700)

HPU upgrade to Synapse 1.21.3 (#703)

b1e48bd

Signed-off-by: Liu, Kaixuan <[email protected]>

kaixuanliu and others added 11 commits August 21, 2025 09:52

Upgrade to IPEX 2.8 (#702)

5e870c0

Signed-off-by: Liu, Kaixuan <[email protected]>

Parse modules.json to identify default Dense modules (#701)

57c811e

Add padding_side and pad_token_id in OrtBackend (#705)

35c0f77

Merge main branch into new-model-addition

5dcc614

Fix conflicts in backends/candle/src/lib.rs

1b97bd9

Merge branch 'main' into new-model-addition

b56a6c2

Place serde::Deserialize import under candle feature

47a2b05

Fix Gemma3Model::load for CUDA

725c241

Set default DType::Float32 for Gemma3

674797b

Otherwise it will default to `DType::Float16` when `--features candle-cuda`

Merge branch 'main' into new-model-addition

dc89bc3

alvarobartt requested a review from Narsil September 4, 2025 14:32

alvarobartt mentioned this pull request Sep 4, 2025

Update version to 1.8.1 #712

Merged

5 tasks

alvarobartt merged commit abf7d42 into main Sep 4, 2025
14 checks passed

alvarobartt deleted the new-model-addition branch September 4, 2025 15:09

BrewTestBot mentioned this pull request Sep 4, 2025

text-embeddings-inference 1.8.1 Homebrew/homebrew-core#236184

Merged

alvarobartt mentioned this pull request Sep 10, 2025

Add test_gemma3.rs for EmbeddingGemma #718

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add Gemma3 architecture (text-only) #711

Add Gemma3 architecture (text-only) #711

Uh oh!

alvarobartt commented Sep 4, 2025

Uh oh!

Uh oh!

Uh oh!

Add Gemma3 architecture (text-only) #711

Add Gemma3 architecture (text-only) #711

Uh oh!

Conversation

alvarobartt commented Sep 4, 2025

What does this PR do?

Before submitting

Who can review?

Uh oh!

Uh oh!

Uh oh!