feat: Support images and PDFs in tool results #735

gadenbuie · 2025-08-27T15:18:33Z

This PR adds support for tool results to return images or PDF results.

This isn't a feature that's widely supported in provider APIs, but we get around this limitation by moving image and PDF content out of the tool result and into the abstract user turn that carries the tool results.

We support two cases:

Directly returning a content_image() or content_pdf() as a tool result.
Returning a list that contains, at most one level deep, these content types.

In all cases, we replace the value in the tool result with "[see below]" (or "[see below: item N]" in the list case) and we wrap the extra content in <content tool-call-id="abc123" item="N">...content...</content> XML tags.

Notes

OpenAI requires that tool results are a separate message and follow the assistant message. This appears to be common among providers when the tool results are separated. I checked all as_json() methods for Turn and updated them to return tool_message, user_message.
tool_string() doesn't support having these content types in the tool result because it calls jsonlite::toJSON(). I updated this function so that internally we can force the JSON conversion for printing, but require this work for the actual tool results that we send across the wire. If it fails, it now fails with a more informative error message. (Internally we call this function when echoing the tool result, before we've pulled out the content types.)

Example

pkgload::load_all()
#> ℹ Loading ellmer

get_cat_image <- function() {
  size <- sample(200:300, 1)
  url <- sprintf("https://placecats.com/%d/%d", size, size)

  tmpf <- withr::local_tempfile(fileext = ".jpg")
  download.file(url, tmpf, quiet = TRUE)

  content_image_file(tmpf, resize = "none")
}

chat <- chat("openai/gpt-5-nano", echo = "none")
# chat <- chat("anthropic")
# chat <- chat("google_gemini")
# chat <- chat_deepseek(echo = "output")
# There aren't many tool+vision Ollama models, but this one should work (but not on my M1)
# chat <- chat_ollama(model = "mistral-small3.2", echo = "output")
chat$register_tool(
  tool(
    function(n_images = 1) {
      if (n_images == 1) {
        get_cat_image()
      } else {
        lapply(seq_len(n_images), function(i) get_cat_image())
      }
    },
    name = "get_cat_image",
    description = "Gets a random cat image.",
    arguments = list(
      n_images = type_integer("Number of cat images to get at once.")
    )
  )
)

. <- chat$chat(
  "Get a random cat image and describe what the cat is feeling."
)
. <- chat$chat(
  "Get 2 random cat images and describe what the cats are feeling."
)

chat
#> <Chat OpenAI/gpt-5-nano turns=8 tokens=1826/1942 $0.00>
#> ── user [149] ──────────────────────────────────────────────────────────────────
#> Get a random cat image and describe what the cat is feeling.
#> ── assistant [281] ─────────────────────────────────────────────────────────────
#> [tool request (call_jjoIvBbPW336sG0FWh6U9b5U)]: get_cat_image(n_images = 1L)
#> ── user [-62] ──────────────────────────────────────────────────────────────────
#> [tool result  (call_jjoIvBbPW336sG0FWh6U9b5U)]: [see below]
#> <content tool-call-id="call_jjoIvBbPW336sG0FWh6U9b5U">
#> [inline image]
#> </content>
#> ── assistant [624] ─────────────────────────────────────────────────────────────
#> The cat looks curious and attentive, perhaps a touch cautious. Reasons:
#> - Ears are forward and upright, signaling interest.
#> - Wide, focused eyes suggest it’s watching or evaluating something.
#> - Whiskers are forward, which often happens when a cat is exploring or concentrating.
#> - Body is upright and alert, not relaxed or scared.
#> 
#> In short: curious, observant, and a bit cautious about its surroundings. If you’d like, I can give you a few short captions to pair with the image.
#> ── user [-497] ─────────────────────────────────────────────────────────────────
#> Get 2 random cat images and describe what the cats are feeling.
#> ── assistant [346] ─────────────────────────────────────────────────────────────
#> [tool request (call_M2b8yTCopQZWj0HyA5zVT0d1)]: get_cat_image(n_images = 2L)
#> ── user [-27] ──────────────────────────────────────────────────────────────────
#> [tool result  (call_M2b8yTCopQZWj0HyA5zVT0d1)]: ["[see below: item 1]","[see below: item 2]"]
#> <content tool-call-id="call_M2b8yTCopQZWj0HyA5zVT0d1" item="1">
#> [inline image]
#> </content>
#> <content tool-call-id="call_M2b8yTCopQZWj0HyA5zVT0d1" item="2">
#> [inline image]
#> </content>
#> ── assistant [691] ─────────────────────────────────────────────────────────────
#> Here are feel descriptions for the two images:
#> 
#> - Item 1:
#>   - Left cat: confident and curious. Ears forward, eyes open and focused, relaxed posture.
#>   - Right cat: content and sleepy. Eyes closed, resting head/face on paws, relaxed body.
#> 
#> - Item 2:
#>   - Orange cat: playful and curious. Body lowered, eyes toward the green toy, ears forward, paw/face near the toy, engaged in play or exploration.
#> 
#> Want me to suggest short captions for each image?

chat$get_turns()[[3]] |> contents_markdown() |> knitr::asis_output()

chat$get_turns()[[7]] |> contents_markdown() |> knitr::asis_output()

By moving these content types out of the tool results and into the abstract user turn

…or future

This better links the content to its source, but generally hides the markup from user view (shinychat doesn't show the XML tags in assistant output).

R/chat-tools.R

gadenbuie added 4 commits August 27, 2025 10:27

feat: Support image and PDF tool results

45fcc4a

By moving these content types out of the tool results and into the abstract user turn

chore(deepseek): Tool results need to be before text output

1aa7c46

chore(databricks): Doesn't support image/pdf content but leave note f…

b318a65

…or future

chore: Wrap moved content in xml tags

b2908db

This better links the content to its source, but generally hides the markup from user view (shinychat doesn't show the XML tags in assistant output).

gadenbuie commented Aug 27, 2025

View reviewed changes

R/chat-tools.R Outdated Show resolved Hide resolved

gadenbuie added 3 commits August 27, 2025 15:26

refactor: Wait until API request to separate tool results and content

ec909f4

tests: Add tests around tool_results_separate_content()

a73cdb5

fix: update tool result value correctly in list case

cee44b3

gadenbuie mentioned this pull request Aug 27, 2025

feat: Support images/PDFs in tool results posit-dev/shinychat#106

Open

gadenbuie added 2 commits August 27, 2025 17:00

docs: Document image/pdf tool output in vignette

baff771

docs: Add NEWS item

4d640c7

gadenbuie marked this pull request as ready for review August 27, 2025 21:01

gadenbuie requested a review from hadley August 27, 2025 21:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Support images and PDFs in tool results #735

feat: Support images and PDFs in tool results #735

Uh oh!

gadenbuie commented Aug 27, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

feat: Support images and PDFs in tool results #735

Are you sure you want to change the base?

feat: Support images and PDFs in tool results #735

Uh oh!

Conversation

gadenbuie commented Aug 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Notes

Example

Uh oh!

Uh oh!

Uh oh!

gadenbuie commented Aug 27, 2025 •

edited

Loading