-
Notifications
You must be signed in to change notification settings - Fork 552
Docs: Add comprehensive troubleshooting guide #1801
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Docs: Add comprehensive troubleshooting guide #1801
Conversation
- Add troubleshooting guide covering common setup and training issues - Include solutions for OOM errors, GPU detection, and tokenizer downloads - Cover distributed training issues (multi-node setup, NCCL errors) - Add configuration troubleshooting (boolean flags, parallelism settings) - Include performance debugging tips and monitoring commands - Provide clear examples and commands for each solution - Add links to related documentation and resources This guide will help users quickly resolve common problems and reduce repetitive questions on forums and GitHub issues.
Hi @khlaifiabilel! Thank you for your pull request and welcome to our community. Action RequiredIn order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you. ProcessIn order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA. Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with If you have received this in error or have any questions, please contact us at [email protected]. Thanks! |
Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks! |
```toml | ||
[model] | ||
fsdp_cpu_offload = true | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are other approaches to do first like TP and effectively reducing the global batch size. CPU offload can be very slow and should be the last resort?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, to resolve OOM issue, it's a very case-by-case decision and a general guide here might not be that helpful
4. **Start with debug model**: | ||
```bash | ||
CONFIG_FILE="./torchtitan/models/llama3/train_configs/debug_model.toml" ./run_train.sh | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add py-spy
to understand which rank is doing what.
|
||
### Training Hangs at Start | ||
|
||
**Problem:** Training process hangs without error messages |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sometimes, it is because we only output rank0 log and cause some ranks errors are swallowed by TorchRun. LOG_RANK=0,1,2,3,4,5,6,7
can help. I used this a lot.
|
||
2. **Check checkpoint compatibility**: | ||
- Ensure same model architecture | ||
- Verify parallelism settings match |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Parallelism settings match is not required.
|
||
1. **Set NCCL timeout**: | ||
```bash | ||
export NCCL_TIMEOUT=1800 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TorchTitan has a comm timeout setting. We should just use that?
compile = true | ||
``` | ||
|
||
3. **Check data loading** isn't the bottleneck: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you ever encountered this case when using TorchTitan or have you heard such a case?
``` | ||
|
||
2. **Check TFLOPs and MFU** in logs: | ||
- Compare with benchmark numbers in `benchmarks/` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And check the trace to understand if there are exposed communications.
python -m torchtitan.config.manager --job.config_file config.toml | ||
|
||
# Run with maximum debugging | ||
export TORCH_DISTRIBUTED_DEBUG=DETAIL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if we want to advocate this. Given that we now have NCCL flight recorder. It is usually easier to just check flight recorder result.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One most common use case people turn on these debug information is NCCL timeout. We should just focus on NCCL flight recorder
. Other debug log is likely too expert-only information.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the contribution, I think it's a good guide for torchtitan new users! However, I think it's better to be made into a blog / tutorial, but not put into torchtitan/docs part, because of following reasons:
- In torchtitan docs section, the instructions are divided into different topics, and this guide has a lot overlap with other docs.
- There are basic GPU usage knowledge, model training knowledge, which is not specific related to torchtian.
- Sometimes debugging is highly case-by-case, providing general starting point is good but might not be helpful in the end.
I'm open to discuss where we should put our docs and how to organize them.
```toml | ||
[model] | ||
fsdp_cpu_offload = true | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, to resolve OOM issue, it's a very case-by-case decision and a general guide here might not be that helpful
nvidia-smi | ||
``` | ||
|
||
3. **Verify network connectivity** (multi-node): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part is not even related to torchtitan
--output_path ./checkpoint | ||
``` | ||
|
||
### Loss is NaN or Diverging |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also this issue might be very different under different cases: Simpler case might be learning rate is too high, but sometimes it even because of a pytorch or torchtitan bug
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe also good to add how to enable anomaly detection.
2. **Verify all nodes can communicate**: | ||
```bash | ||
# From each node, ping master | ||
ping $MASTER_ADDR | ||
``` | ||
|
||
3. **Check firewall settings** - ensure port is open |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part does not seem to be related to torchtitan
--profiling.no-enable-memory-snapshot | ||
``` | ||
|
||
See [debugging.md](debugging.md#overriding-boolean-flags-from-toml-via-cli) for more details. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As you mentioned, this part is overlapped with debugging.md
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a nice doc to have.
But I have the worry that this document explicitly mentions a lot of configs, which would be easy to break, or significantly slow down the development / refactor.
I don't want the torchtitan core team to maintain this doc. Maybe it should be part of the broader releasing engineering's job.
--output_path ./checkpoint | ||
``` | ||
|
||
### Loss is NaN or Diverging |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe also good to add how to enable anomaly detection.
Agree. The maintenance cost of this is relatively high. |
Summary
This PR adds a comprehensive troubleshooting guide to help users resolve common issues encountered when using TorchTitan.
Motivation
What's Included
Setup & Installation
Training Issues
Configuration Issues
Distributed Training
Performance Issues
Getting Help
Testing
Type of Change
Additional Context
This guide complements existing documentation by providing quick, actionable solutions to common problems. It's designed to be easily searchable and scannable for users who are stuck. The guide focuses on practical solutions with copy-pasteable commands and clear examples.
Future enhancements could include:
Checklist