Skip to content

Conversation

iamlemec
Copy link
Collaborator

@iamlemec iamlemec commented Sep 5, 2025

Add support for Qwen3 reranking models. This is largely based on #14029 by @ngxson, with a few tweaks to reflect changes to the codebase in the interim.

This hardcodes the chat template provided by in the README.md, which I'm assuming is the intended usage. If folks want to be able to change that, then we'd need a new CLI option. The template uses string substitution rather than jinja, as it seems like jinja is only used for chat messages.

Edit: Here's an example usage similar to that used in the official Qwen repo. Note that \t separates queries from documents and \n separates different prompts.

build/bin/llama-embedding -m qwen3-reranker-0.6b-f32.gguf --embd-normalize -1 -p "What is the capital of China?\tThe capital of China is Beijing.\nExplain gravity\tGravity is a force that attracts two bodies towards each other."

Notice that we need to pass --embd-normalize -1 to disable normalization (the default is L2 norm).

@iamlemec iamlemec requested a review from ngxson as a code owner September 5, 2025 21:21
@github-actions github-actions bot added examples python python script changes server labels Sep 5, 2025
@ggerganov
Copy link
Member

Great - will take a look tomorrow. Would be useful to add a basic usage example in the OP of this PR.

@ggerganov ggerganov self-requested a review September 7, 2025 18:35
@iamlemec
Copy link
Collaborator Author

iamlemec commented Sep 8, 2025

Yup, will add a usage example up above. Actually encountering some numerical differences comparing the output here to the transformers numbers. Might want to hold off on review/merge until I can sort this out. (Or if you spot something amiss in the code, that would be great too.)

@ggerganov ggerganov marked this pull request as draft September 10, 2025 11:52
@iamlemec
Copy link
Collaborator Author

Ok, finally fixed it! Now we have numerical parity with the HF implementation. It turned out to be a small difference in the chat template. Should be ready for review @ggerganov.

@iamlemec iamlemec marked this pull request as ready for review September 18, 2025 20:24
bool last = cparams.pooling_type == LLAMA_POOLING_TYPE_LAST;
const bool last = (
cparams.pooling_type == LLAMA_POOLING_TYPE_LAST ||
(cparams.pooling_type == LLAMA_POOLING_TYPE_RANK && arch == LLM_ARCH_QWEN3) // qwen3 reranking & embedding models use last token
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering if it makes sense to remove pooling type RANK all together from libllama? Do you have any thoughts about if having a separate pooling class RANK is really necessary?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you could get really close to merging RANK with LAST. The main differentiator is in llm_graph_context::build_pooling where you apply cls_out to map from the last token of the last hidden state to the classification output (usual yes/no). Unlike the other pooling types, you actually need knowledge of the model weights to do the calculation.

@ggerganov ggerganov merged commit b5bd037 into ggml-org:master Sep 25, 2025
56 of 59 checks passed
pwilkin pushed a commit to pwilkin/llama.cpp that referenced this pull request Sep 25, 2025
struct pushed a commit to struct/llama.cpp that referenced this pull request Sep 26, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
examples python python script changes server
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants