-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Description
Outline & Motivation
We find that PyTorch Lightning can introduce some redundant host & device synchronizations while trying to optimize the performance of some important DL workloads. These syncs affects performance and blocks the usage of CUDA graph.
Progress Bar Metric
As shown in logger_connector/result.py#L491-L493:
# populate progress_bar metrics. convert tensors to numbers
if result_metric.meta.prog_bar:
metrics["pbar"][forked_name] = convert_tensors_to_scalars(value)
If a metric is logged with prog_bar=True
, e.g. pl_module.log('lr', lr, prog_bar=True)
, then no matter if user enables progress bar or not, the metric tensor is always converted to scalar, which introduces a host & device sync.
It's better to avoid this conversion if the user don't want to show progress bar, i.e. when trainer.enable_progress_bar = False
. Then we can avoid such synchronizations.
Best Metric Device
The metric tensor is always put on the GPU when training with a CUDA device. Thus whenever the metric is retrieved there is a device to host synchronization.
However, I'd propose putting the metric on the device exactly as the value/tensor update it. For example:
- User is likely to update metric
global_step
using a scalar on CPU, so the metric forglobal_step
should be on CPU side; - User is likely to update metric
loss
using a tensor on GPU, so the metric forloss
should be on GPU side;
In this way, updating global_step
metric won't introduce a host & device synchronization since the metric is on CPU instead of GPU now. And in future if the user retrieves global_step
metric, there's also no sync.
The original logic is at core/module.py#L657-L661:
value = (
value.clone().detach()
if isinstance(value, Tensor)
else torch.tensor(value, device=self.device, dtype=_get_default_dtype())
)
Pitch
No response
Additional context
No response