Skip to content

Conversation

rolshoven
Copy link
Contributor

@rolshoven rolshoven commented Sep 16, 2025

Background

See Issue #966:

Currently, there are a few things that are note exposed through the LiteLLMModelConfig that can be very useful when running evaluations:

  • It would be nice to have a verbose flag, in case you want to debug something related to litellm
  • If one knows the maximum context length of your model, it would be nice if we could set that instead of relying on the default length of 4096 that is currently hardcoded in the max_length property
  • Different APIs can differ in their robustness, and they might have different rate limits. It would be nice to configure the number of retries that are performed when calling the API, as well as the waiting time in between requests, and maybe a timeout if a request takes too long.

Additionally, it would be nice to apply the current strategy for the o1 model in _prepare_max_new_tokens to other reasoning models as well.

Changes in this PR

This PR introduces the following new options in the LiteLLMModelConfig:

"""
(...)
verbose (bool):
    Whether to enable verbose logging. Default is False.
max_model_length (int | None):
    Maximum context length for the model. If None, infers the model's default max length.
api_max_retry (int):
    Maximum number of retries for API requests. Default is 8.
api_retry_sleep (float):
    Initial sleep time (in seconds) between retries. Default is 1.0.
api_retry_multiplier (float):
    Multiplier for increasing sleep time between retries. Default is 2.0.
timeout (float):
    Request timeout in seconds. Default is None (no timeout).
(...)
"""

The increase in the allowed number of tokens (see _prepare_max_new_tokens) is now calculated for all models that are recognized as reasoning models by litellm (as indicated by their supports_reasoning function). Instead of having hardcoded upper bounds, we use litellm's get_max_tokens helper function, or, if this fails, we query the maximum context length from different endpoints on OpenRouter. If the specified provider is present in that list, we get the information right from OpenRouter. Otherwise, we will choose the minimum context length among all OpenRouter providers to ensure that it works at least with all providers listed there. If this also fails, we will return the default context length of 4096, the same one as currently hardcoded.

In order to use the suggest_reasoning function of litellm, I had to update the minimum required version of litellm in the pyproject.toml file to 1.66.0.

@HuggingFaceDocBuilderDev
Copy link
Collaborator

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

response = litellm.completion(**kwargs)
content = response.choices[0].message.content

if content and "<think>" in content:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is handled but the remove thinking tag option in the cli

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see! Then I guess this happens outside the model classes, should I just remove that from litellm_model.py then?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah ! you can also make sure that when you use --remove_thinking_tags it work as expected :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed it, however I was unable to reproduce the case where the reasoning traces are in the output of the model because the reasoning is actually saved under the reasonings attribute in the ModelResponse as defined on lines 365-374 here.

I did however verify (using a breakpoint in my debugging config) that remove_reasoning_tags is executed as part of _post_process_outputs in the Pipeline (by default --remove-reasoning-tags is set to True). So I think it is safe to remove the code that you mentioned, and to assume that the stripping of reasoning content should work if at some point there is actual reasoning content in the text attribute of ModelResponse.

Copy link
Member

@NathanHB NathanHB left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good ! only a few questions and good to emrge

@rolshoven rolshoven force-pushed the litellm_model_changes branch from 1f36913 to 3b6101d Compare September 26, 2025 15:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants