📽 Multi image support for GRPO/RLOO #4113

qgallouedec · 2025-09-19T22:48:39Z

This PR belongs to a sequence of PR that aims to refactor the generation part of GRPO/RLOO to allow for easier customization and ultimately tool calling

🟩 Drop image_split_sizes in favour of image_grid_thw #4111

try with

from datasets import load_dataset

from trl import GRPOConfig, GRPOTrainer


# If not handled properly, prompt truncation may truncate image token
dataset = load_dataset("trl-internal-testing/zen-multi-image", "conversational_prompt_only", split="train")

dataset = dataset.filter(lambda x: len(x["images"]) > 0) # currently, mixing samples with and without images is not supported

def my_reward_function(prompts, completions, **kwargs):
    return [1.0] * len(prompts)

training_args = GRPOConfig(
    output_dir="tmp_dir",   
    learning_rate=0.1,  # increase the learning rate to speed up the test
    per_device_train_batch_size=6,  # reduce the batch size to reduce memory usage
    num_generations=3,  # reduce the number of generations to reduce memory usage
    max_completion_length=8,  # reduce the completion length to reduce memory usage
    max_prompt_length=32,
    report_to="none",
)
trainer = GRPOTrainer(
    model="Qwen/Qwen2-VL-2B-Instruct",
    reward_funcs=my_reward_function,  # define a dummy reward function
    args=training_args,
    train_dataset=dataset,
)

previous_trainable_params = {n: param.clone() for n, param in trainer.model.named_parameters()}

trainer.train()

…_thw` in GRPO and RLOO trainers; update `split_pixel_values_by_grid` to use `image_grid_thw`

HuggingFaceDocBuilderDev · 2025-09-19T22:53:05Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

qgallouedec · 2025-09-20T00:22:58Z

tests/test_grpo_trainer.py

        )
        trainer = GRPOTrainer(
            model=model_id,
-            reward_funcs="trl-internal-testing/tiny-Qwen2ForSequenceClassification-2.5",


we don't support visual reward model, so it doesn't really make sense to test this case, where the image is dropped and a warning is raised.

qgallouedec · 2025-09-20T00:25:37Z

trl/trainer/grpo_trainer.py

+                        # VLM reward models aren't supported yet, so we drop the image and raise a warning if needed
+                        for prompt in prompts:
+                            for turn in prompt:
+                                if isinstance(turn["content"], list):
+                                    logger.warning_once("Visual reward models aren't supported yet; dropping image.")
+                                    turn["content"] = " ".join(
+                                        e["text"] for e in turn["content"] if e["type"] == "text"
+                                    )


from

[{"role": "user", "content": [{"type": "image"}, {"type": "text", "text": "What color is the sky?"}]}]

to

[{"role": "user", "content": "What color is the sky?"}]

plus raise warning

qgallouedec · 2025-09-20T00:26:35Z

trl/trainer/grpo_trainer.py

-        # We don't yet support visual reward models/function, so we keep a copy of the original text-only prompts for
-        # later use in the reward computation. If images are present, we insert {"type": "image"} as required by the
-        # VLM chat template.
-        original_prompts = copy.deepcopy(prompts)


instead of keeping the original prompt, we just drop the image later, and raise a warning, see https://github.com/huggingface/trl/pull/4113/files#r2364899902

qgallouedec · 2025-09-20T00:26:59Z

trl/trainer/grpo_trainer.py

        # important because rewards will be normalized per group, and completions are distributed. We will later slice
        # rewards_per_func to extract each process's subset.
-        rewards_per_func = self._calculate_rewards(inputs, original_prompts, completions, completion_ids_list)
+        rewards_per_func = self._calculate_rewards(inputs, prompts, completions, completion_ids_list)


see https://github.com/huggingface/trl/pull/4113/files#r2364900545

qgallouedec · 2025-09-20T00:28:40Z

trl/trainer/grpo_trainer.py

+                if self._logs["images"]:
+                    table["images"] = []
+                    for image_list in self._logs["images"]:
+                        # Convert images to wandb Image objects for proper visualization
+                        table["images"].append([wandb.Image(image) for image in image_list])


qgallouedec · 2025-09-20T00:31:24Z

trl/trainer/utils.py

+    boundaries = [0, *accumulate(batch["num_images"])]  # [3, 4, 5] -> [0, 3, 7, 12]
+    sections = [sum(lengths[boundaries[i] : boundaries[i + 1]]) for i in range(len(batch["num_images"]))]
+    split_values = list(torch.split(batch["pixel_values"], sections, dim=0))
+    image_grid_thw = list(torch.split(batch["image_grid_thw"], batch["num_images"], dim=0))
+    return {**batch, "pixel_values": split_values, "image_grid_thw": image_grid_thw}


instead of keeping image_grid_thw as is, we need to split it depending on the number of images. It gets concatenated later in _get_per_token_logps_and_entropies (see line 807)

qgallouedec · 2025-09-20T00:32:09Z

trl/trainer/grpo_trainer.py

+                model_inputs["image_grid_thw"] = torch.cat(image_grid_thw[start : start + batch_size])
+                start_pixel_idx = 0 if start == 0 else torch.cat(image_grid_thw[:start]).prod(-1).sum().item()
+                end_pixel_idx = torch.cat(image_grid_thw[: start + batch_size]).prod(-1).sum().item()


See https://github.com/huggingface/trl/pull/4113/files#r2364904060, image_grid_thw is not a tensor anymore, but a list of tensor

lewtun

LGTM with a question about whether raising an error vs a warning is best when images + text are being passed to the reward function

tests/test_grpo_trainer.py

lewtun · 2025-09-22T12:28:58Z

tests/test_rloo_trainer.py

+
+        self.assertIsNotNone(trainer.state.log_history[-1]["train_loss"])
+
+        for n, param in previous_trainable_params.items():


Does the same comment for GRPO apply here? https://github.com/huggingface/trl/pull/4113/files#diff-96dca172e696190fc3e1469166e88aface95ebae959284c6806f2e25d2217c16R1587

answered here #4113 (comment)

lewtun · 2025-09-22T12:33:38Z

trl/trainer/grpo_trainer.py

+                        for prompt in prompts:
+                            for turn in prompt:
+                                if isinstance(turn["content"], list):
+                                    logger.warning_once("Visual reward models aren't supported yet; dropping image.")


Would raising an error be better than a warning? Otherwise I could imagine the warning could be missed and the training "fails silently" because the reward is only computed on the text part.

Yes, I see. I wonder if anyone would want to train a VLM with a standard LM reward model (ie, not visual reward model). But so far, I've never seen that. We could always support it in the future if there is demand for it. I'll remove this warning, and if the user tries it, the rendering of the chat template will fail, which will prevent from ending in the case of the training failing silently that you describe.

lewtun · 2025-09-22T12:35:47Z

trl/trainer/rloo_trainer.py

+                    table["images"] = []
+                    for image_list in self._logs["images"]:
+                        # Convert images to wandb Image objects for proper visualization
+                        table["images"].append([wandb.Image(image) for image in image_list])


At some point it would be nice to also add the trackio variant for table images

qgallouedec added 4 commits September 19, 2025 20:57

Refactor image handling: replace image_split_sizes with `image_grid…

552e899

…_thw` in GRPO and RLOO trainers; update `split_pixel_values_by_grid` to use `image_grid_thw`

simpler

449ef07

gfpo

c8933aa

multi-image grpo

229c554

qgallouedec changed the base branch from main to drop-image_split_sizes September 19, 2025 22:53

qgallouedec added 2 commits September 19, 2025 23:31

log with wandb

3ca6ad5

no vlm reward models

dcf4b92

qgallouedec commented Sep 20, 2025

View reviewed changes

qgallouedec added 3 commits September 20, 2025 00:37

rloo

30ad7ca

gfpo

86cc30b

fix

088897b

qgallouedec changed the title ~~Multi image support for GRPO/RLOO~~ [WIP] Multi image support for GRPO/RLOO Sep 20, 2025

qgallouedec added 6 commits September 20, 2025 02:52

test peft

d2adc63

fix gfpo

f4c82bf

rloo test

1257796

peft rloo

099a39b

oops

529add6

update test

fc6b11f

qgallouedec mentioned this pull request Sep 20, 2025

😷 Refactor GRPO/RLOO to isolate _generate #4114

Merged

qgallouedec and others added 4 commits September 20, 2025 05:18

debug

f998432

skip failing test

fa73876

Merge branch 'main' into drop-image_split_sizes

52d8bd9

Merge branch 'drop-image_split_sizes' into multi-image-support

dfc0d38

test fixed!

fc52e68

qgallouedec changed the title ~~[WIP] Multi image support for GRPO/RLOO~~ Multi image support for GRPO/RLOO Sep 20, 2025

lewtun approved these changes Sep 22, 2025

View reviewed changes

qgallouedec added 2 commits September 22, 2025 16:17

Merge branch 'main' into drop-image_split_sizes

e17ec42

Merge branch 'drop-image_split_sizes' into multi-image-support

efbb03a

Base automatically changed from drop-image_split_sizes to main September 22, 2025 22:38

Merge branch 'main' into multi-image-support

562c662

qgallouedec mentioned this pull request Sep 22, 2025

🔭 Align param passing to VLM configs in generate_tiny_models #4118

Merged

qgallouedec and others added 2 commits September 22, 2025 17:47

Merge branch 'main' into multi-image-support

485781c

update layers to ignore

05270f8

qgallouedec changed the title ~~Multi image support for GRPO/RLOO~~ 📽 Multi image support for GRPO/RLOO Sep 22, 2025

qgallouedec added 2 commits September 22, 2025 23:57

clarify image column desc

1c53094

rm VLM x RM warning

9b6652e

qgallouedec merged commit 68408d7 into main Sep 23, 2025
9 of 12 checks passed

qgallouedec deleted the multi-image-support branch September 23, 2025 00:17


		self.assertIsNotNone(trainer.state.log_history[-1]["train_loss"])

		for n, param in previous_trainable_params.items():

📽 Multi image support for GRPO/RLOO #4113

📽 Multi image support for GRPO/RLOO #4113

Conversation

qgallouedec commented Sep 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Sep 19, 2025

Uh oh!

qgallouedec Sep 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

qgallouedec Sep 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

qgallouedec Sep 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lewtun left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

qgallouedec commented Sep 19, 2025 •

edited

Loading

qgallouedec Sep 20, 2025 •

edited

Loading

qgallouedec Sep 20, 2025 •

edited

Loading

qgallouedec Sep 20, 2025 •

edited

Loading