Skip to content

Conversation

gadenbuie
Copy link
Collaborator

@gadenbuie gadenbuie commented Aug 27, 2025

This PR adds support for tool results to return images or PDF results.

This isn't a feature that's widely supported in provider APIs, but we get around this limitation by moving image and PDF content out of the tool result and into the abstract user turn that carries the tool results.

We support two cases:

  • Directly returning a content_image() or content_pdf() as a tool result.
  • Returning a list that contains, at most one level deep, these content types.

In all cases, we replace the value in the tool result with "[see below]" (or "[see below: item N]" in the list case) and we wrap the extra content in <content tool-call-id="abc123" item="N">...content...</content> XML tags.

Notes

  • OpenAI requires that tool results are a separate message and follow the assistant message. This appears to be common among providers when the tool results are separated. I checked all as_json() methods for Turn and updated them to return tool_message, user_message.
  • tool_string() doesn't support having these content types in the tool result because it calls jsonlite::toJSON(). I updated this function so that internally we can force the JSON conversion for printing, but require this work for the actual tool results that we send across the wire. If it fails, it now fails with a more informative error message. (Internally we call this function when echoing the tool result, before we've pulled out the content types.)

Example

pkgload::load_all()
#> ℹ Loading ellmer

get_cat_image <- function() {
  size <- sample(200:300, 1)
  url <- sprintf("https://placecats.com/%d/%d", size, size)

  tmpf <- withr::local_tempfile(fileext = ".jpg")
  download.file(url, tmpf, quiet = TRUE)

  content_image_file(tmpf, resize = "none")
}

chat <- chat("openai/gpt-5-nano", echo = "none")
# chat <- chat("anthropic")
# chat <- chat("google_gemini")
# chat <- chat_deepseek(echo = "output")
# There aren't many tool+vision Ollama models, but this one should work (but not on my M1)
# chat <- chat_ollama(model = "mistral-small3.2", echo = "output")
chat$register_tool(
  tool(
    function(n_images = 1) {
      if (n_images == 1) {
        get_cat_image()
      } else {
        lapply(seq_len(n_images), function(i) get_cat_image())
      }
    },
    name = "get_cat_image",
    description = "Gets a random cat image.",
    arguments = list(
      n_images = type_integer("Number of cat images to get at once.")
    )
  )
)

. <- chat$chat(
  "Get a random cat image and describe what the cat is feeling."
)
. <- chat$chat(
  "Get 2 random cat images and describe what the cats are feeling."
)
chat
#> <Chat OpenAI/gpt-5-nano turns=8 tokens=1826/1942 $0.00>
#> ── user [149] ──────────────────────────────────────────────────────────────────
#> Get a random cat image and describe what the cat is feeling.
#> ── assistant [281] ─────────────────────────────────────────────────────────────
#> [tool request (call_jjoIvBbPW336sG0FWh6U9b5U)]: get_cat_image(n_images = 1L)
#> ── user [-62] ──────────────────────────────────────────────────────────────────
#> [tool result  (call_jjoIvBbPW336sG0FWh6U9b5U)]: [see below]
#> <content tool-call-id="call_jjoIvBbPW336sG0FWh6U9b5U">
#> [inline image]
#> </content>
#> ── assistant [624] ─────────────────────────────────────────────────────────────
#> The cat looks curious and attentive, perhaps a touch cautious. Reasons:
#> - Ears are forward and upright, signaling interest.
#> - Wide, focused eyes suggest it’s watching or evaluating something.
#> - Whiskers are forward, which often happens when a cat is exploring or concentrating.
#> - Body is upright and alert, not relaxed or scared.
#> 
#> In short: curious, observant, and a bit cautious about its surroundings. If you’d like, I can give you a few short captions to pair with the image.
#> ── user [-497] ─────────────────────────────────────────────────────────────────
#> Get 2 random cat images and describe what the cats are feeling.
#> ── assistant [346] ─────────────────────────────────────────────────────────────
#> [tool request (call_M2b8yTCopQZWj0HyA5zVT0d1)]: get_cat_image(n_images = 2L)
#> ── user [-27] ──────────────────────────────────────────────────────────────────
#> [tool result  (call_M2b8yTCopQZWj0HyA5zVT0d1)]: ["[see below: item 1]","[see below: item 2]"]
#> <content tool-call-id="call_M2b8yTCopQZWj0HyA5zVT0d1" item="1">
#> [inline image]
#> </content>
#> <content tool-call-id="call_M2b8yTCopQZWj0HyA5zVT0d1" item="2">
#> [inline image]
#> </content>
#> ── assistant [691] ─────────────────────────────────────────────────────────────
#> Here are feel descriptions for the two images:
#> 
#> - Item 1:
#>   - Left cat: confident and curious. Ears forward, eyes open and focused, relaxed posture.
#>   - Right cat: content and sleepy. Eyes closed, resting head/face on paws, relaxed body.
#> 
#> - Item 2:
#>   - Orange cat: playful and curious. Body lowered, eyes toward the green toy, ears forward, paw/face near the toy, engaged in play or exploration.
#> 
#> Want me to suggest short captions for each image?
chat$get_turns()[[3]] |> contents_markdown() |> knitr::asis_output()
image
chat$get_turns()[[7]] |> contents_markdown() |> knitr::asis_output()
image image

By moving these content types out of the tool results and into the abstract user turn
This better links the content to its source, but generally hides the markup
from user view (shinychat doesn't show the XML tags in assistant output).
@gadenbuie gadenbuie marked this pull request as ready for review August 27, 2025 21:01
@gadenbuie gadenbuie requested a review from hadley August 27, 2025 21:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant