Redundant Host & Device Synchronizations

### Outline & Motivation

We find that PyTorch Lightning can introduce some redundant host & device synchronizations while trying to optimize the performance of some important DL workloads. These syncs affects performance and blocks the usage of CUDA graph.

## Progress Bar Metric

As shown in [logger_connector/result.py#L491-L493](https://github.com/Lightning-AI/pytorch-lightning/blob/2.5.5/src/lightning/pytorch/trainer/connectors/logger_connector/result.py#L491-L493):

```python
# populate progress_bar metrics. convert tensors to numbers
if result_metric.meta.prog_bar:
    metrics["pbar"][forked_name] = convert_tensors_to_scalars(value)
```

If a metric is logged with `prog_bar=True`, e.g. `pl_module.log('lr', lr, prog_bar=True)`, then no matter if user enables progress bar or not, the metric tensor is always converted to scalar, which introduces a host & device sync.

It's better to avoid this conversion if the user don't want to show progress bar, i.e. when `trainer.enable_progress_bar = False`. Then we can avoid such synchronizations.

## Best Metric Device

The metric tensor is always put on the GPU when training with a CUDA device. Thus whenever the metric is retrieved there is a device to host synchronization.

However, I'd propose putting the metric on the device exactly as the value/tensor update it. For example:

* User is likely to update metric `global_step` using a scalar on CPU, so the metric for `global_step` should be on CPU side;
* User is likely to update metric `loss` using a tensor on GPU, so the metric for `loss` should be on GPU side;

In this way, updating `global_step` metric won't introduce a host & device synchronization since the metric is on CPU instead of GPU now. And in future if the user retrieves `global_step` metric, there's also no sync.

The original logic is at [core/module.py#L657-L661](https://github.com/Lightning-AI/pytorch-lightning/blob/2.5.5/src/lightning/pytorch/core/module.py#L657-L661):

```python
value = (
    value.clone().detach()
    if isinstance(value, Tensor)
    else torch.tensor(value, device=self.device, dtype=_get_default_dtype())
)
```

### Pitch

_No response_

### Additional context

_No response_

cc @lantiga @justusschock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Redundant Host & Device Synchronizations #21232

Outline & Motivation

Progress Bar Metric

Best Metric Device

Pitch

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Redundant Host & Device Synchronizations #21232

Description

Outline & Motivation

Progress Bar Metric

Best Metric Device

Pitch

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions