Skip to content

Conversation

bowenyang008
Copy link
Contributor

@bowenyang008 bowenyang008 commented May 19, 2025

What does this PR do?

Support logging of FSDP2 execution time, wrapped model and verbose config

Test

Tested manually, see: mpt-7b-fsdp2-G2eJiP

@bowenyang008 bowenyang008 changed the title Boweny/fsdp2/meta init time report FSDP2 time and verbose logging May 19, 2025
@bowenyang008 bowenyang008 marked this pull request as ready for review May 19, 2025 18:14
@bowenyang008
Copy link
Contributor Author

@dakinggg when do we use Composer custom Loggers and when do we use generic python logging module?

Copy link
Contributor

@dakinggg dakinggg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, please leave MCLI run name in the "manual testing" section of the PR description so it is easy for reviewer to double check and refer back to. Unless run name includes sensitive info, in which case you can DM it to reviewer.

@dakinggg
Copy link
Contributor

@bowenyang008 Composer loggers need to be used if you want to log to an experiment tracker (mlflow, wandb, etc). We also have a custom logger for the MCLI platform, and custom loggers for progress bar logging and "pretty" training progress console logging. But for general debugging/info statements, just use python logging.

@bowenyang008
Copy link
Contributor Author

bowenyang008 commented May 19, 2025

LGTM, please leave MCLI run name in the "manual testing" section of the PR description so it is easy for reviewer to double check and refer back to. Unless run name includes sensitive info, in which case you can DM it to reviewer.

Oh, I just tested locally on my dev machine as the PR is simple enough so I just need to make sure it logs the info to console correctly. But here is a MCLI run I just kicked off: mpt-7b-fsdp2-G2eJiP

2025-05-19 18:36:19,251: rank0[9633][MainThread]: INFO: composer.distributed.fsdp2: FSDP2: Fully sharded model: ComposerMPTCausalLM(
  (model): FSDPMPTForCausalLM(
    (transformer): MPTModel(
      (wte): FSDPSharedEmbedding(50368, 4096)
      (wpe): FSDPEmbedding(2048, 4096)
      (emb_drop): FSDPDropout(p=0.0, inplace=False)
      (blocks): ModuleList(
        (0-31): 32 x CheckpointWrapper(
          (_checkpoint_wrapped_module): FSDPMPTBlock(
            (norm_1): LPLayerNorm((4096,), eps=1e-05, elementwise_affine=True)
            (attn): MultiheadAttention(
              (Wqkv): Linear(in_features=4096, out_features=12288, bias=True)
              (out_proj): Linear(in_features=4096, out_features=4096, bias=True)
            )
            (norm_2): LPLayerNorm((4096,), eps=1e-05, elementwise_affine=True)
            (ffn): MPTMLP(
              (up_proj): Linear(in_features=4096, out_features=16384, bias=True)
              (down_proj): Linear(in_features=16384, out_features=4096, bias=True)
            )
            (resid_attn_dropout): Dropout(p=0.0, inplace=False)
            (resid_ffn_dropout): Dropout(p=0.0, inplace=False)
          )
        )
      )
      (norm_f): FSDPLPLayerNorm((4096,), eps=1e-05, elementwise_affine=True)
    )
  )
  (loss_fn): FSDPCrossEntropyLoss()
)
2025-05-19 18:36:19,251: rank0[9633][MainThread]: INFO: composer.distributed.fsdp2: FSDP2: activation_checkpointing: True
2025-05-19 18:36:19,252: rank0[9633][MainThread]: INFO: composer.distributed.fsdp2: FSDP2: activation_cpu_offload: False
2025-05-19 18:36:19,252: rank0[9633][MainThread]: INFO: composer.distributed.fsdp2: FSDP2: device_mesh: DeviceMesh('cuda', [0, 1, 2, 3, 4, 5, 6, 7], mesh_dim_names=('data_parallel_shard',))
2025-05-19 18:36:19,252: rank0[9633][MainThread]: INFO: composer.distributed.fsdp2: FSDP2: reshard_after_forward: True
2025-05-19 18:36:19,252: rank0[9633][MainThread]: INFO: composer.distributed.prepare_distributed: Prepare FSDP2 took 0.12 seconds
2025-05-19 18:36:20,927: rank0[9633][MainThread]: INFO: composer.distributed.prepare_distributed: Meta Init Device took 1.68 seconds

@dakinggg
Copy link
Contributor

Ah fair enough. Yeah its nice, to launch a run for posterity (although this is obviously a very minor change)

@bowenyang008 bowenyang008 enabled auto-merge (squash) May 19, 2025 18:49
@bowenyang008 bowenyang008 merged commit 41b99f8 into main May 19, 2025
14 checks passed
@bowenyang008 bowenyang008 deleted the boweny/fsdp2/meta-init-time-report branch May 19, 2025 18:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants