Docs: Add comprehensive troubleshooting guide #1801

khlaifiabilel · 2025-10-06T12:08:26Z

Summary

This PR adds a comprehensive troubleshooting guide to help users resolve common issues encountered when using TorchTitan.

Motivation

Users frequently encounter similar issues (OOM errors, setup problems, configuration errors)
Reduces repetitive questions on PyTorch forums and GitHub issues
Provides a central reference for debugging and problem-solving
Makes it easier for new users to get started

What's Included

Setup & Installation

GPU detection issues and PyTorch CUDA configuration
Tokenizer download failures and HuggingFace authentication
Import errors and PYTHONPATH configuration
Pre-commit hook installation and troubleshooting

Training Issues

Out of Memory (OOM) errors with multiple solutions (batch size, activation checkpointing, gradient accumulation, CPU offload)
Training hangs and debugging with environment variables
Checkpoint loading failures and compatibility issues
Loss divergence (NaN/Inf) with solutions for learning rate, gradient clipping, and deterministic debugging

Configuration Issues

Boolean flag overrides from CLI (using --no prefix)
Config validation errors and debugging
Parallelism dimension conflicts with clear examples

Distributed Training

Multi-node setup issues (network configuration, master address/port)
NCCL communication errors and timeout settings
Network interface configuration

Performance Issues

Low GPU utilization diagnosis and solutions
Slow training speed optimization tips
Profiling commands and benchmarking guidance

Getting Help

Where to ask questions (Forums, GitHub Issues, Discussions)
How to report issues effectively with required information
Useful debugging commands for quick diagnosis
Links to related documentation

Testing

All commands have been verified to work
Pre-commit hooks pass (trailing whitespace fixed)
Cross-references to other docs are correct (debugging.md, fsdp.md, checkpoint.md)
Markdown formatting is valid

Type of Change

Documentation update

Additional Context

This guide complements existing documentation by providing quick, actionable solutions to common problems. It's designed to be easily searchable and scannable for users who are stuck. The guide focuses on practical solutions with copy-pasteable commands and clear examples.

Future enhancements could include:

Hardware-specific troubleshooting (AMD GPUs, different CUDA versions)
Visual flowcharts for issue diagnosis
Cloud provider-specific setup guides
Video tutorials for complex multi-node setups

Checklist

Documentation follows the style of this project
Pre-commit hooks pass
All links are valid and working
Cross-references to other docs are correct
I have read the CONTRIBUTING.md document

- Add troubleshooting guide covering common setup and training issues - Include solutions for OOM errors, GPU detection, and tokenizer downloads - Cover distributed training issues (multi-node setup, NCCL errors) - Add configuration troubleshooting (boolean flags, parallelism settings) - Include performance debugging tips and monitoring commands - Provide clear examples and commands for each solution - Add links to related documentation and resources This guide will help users quickly resolve common problems and reduce repetitive questions on forums and GitHub issues.

meta-cla · 2025-10-06T12:08:32Z

Hi @khlaifiabilel!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at [email protected]. Thanks!

meta-cla · 2025-10-06T14:08:13Z

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks!

fegin · 2025-10-06T17:17:59Z

docs/troubleshooting.md

+   ```toml
+   [model]
+   fsdp_cpu_offload = true
+   ```


There are other approaches to do first like TP and effectively reducing the global batch size. CPU offload can be very slow and should be the last resort?

Agree, to resolve OOM issue, it's a very case-by-case decision and a general guide here might not be that helpful

fegin · 2025-10-06T17:19:37Z

docs/troubleshooting.md

+4. **Start with debug model**:
+   ```bash
+   CONFIG_FILE="./torchtitan/models/llama3/train_configs/debug_model.toml" ./run_train.sh
+   ```


Maybe add py-spy to understand which rank is doing what.

fegin · 2025-10-06T17:23:46Z

docs/troubleshooting.md

+
+### Training Hangs at Start
+
+**Problem:** Training process hangs without error messages


Sometimes, it is because we only output rank0 log and cause some ranks errors are swallowed by TorchRun. LOG_RANK=0,1,2,3,4,5,6,7 can help. I used this a lot.

fegin · 2025-10-06T17:24:10Z

docs/troubleshooting.md

+
+2. **Check checkpoint compatibility**:
+   - Ensure same model architecture
+   - Verify parallelism settings match


Parallelism settings match is not required.

fegin · 2025-10-06T17:25:33Z

docs/troubleshooting.md

+
+1. **Set NCCL timeout**:
+   ```bash
+   export NCCL_TIMEOUT=1800


TorchTitan has a comm timeout setting. We should just use that?

fegin · 2025-10-06T17:26:21Z

docs/troubleshooting.md

+   compile = true
+   ```
+
+3. **Check data loading** isn't the bottleneck:


Have you ever encountered this case when using TorchTitan or have you heard such a case?

fegin · 2025-10-06T17:26:51Z

docs/troubleshooting.md

+   ```
+
+2. **Check TFLOPs and MFU** in logs:
+   - Compare with benchmark numbers in `benchmarks/`


And check the trace to understand if there are exposed communications.

fegin · 2025-10-06T17:27:50Z

docs/troubleshooting.md

+python -m torchtitan.config.manager --job.config_file config.toml
+
+# Run with maximum debugging
+export TORCH_DISTRIBUTED_DEBUG=DETAIL


I'm not sure if we want to advocate this. Given that we now have NCCL flight recorder. It is usually easier to just check flight recorder result.

One most common use case people turn on these debug information is NCCL timeout. We should just focus on NCCL flight recorder. Other debug log is likely too expert-only information.

wwwjn

Thanks for the contribution, I think it's a good guide for torchtitan new users! However, I think it's better to be made into a blog / tutorial, but not put into torchtitan/docs part, because of following reasons:

In torchtitan docs section, the instructions are divided into different topics, and this guide has a lot overlap with other docs.
There are basic GPU usage knowledge, model training knowledge, which is not specific related to torchtian.
Sometimes debugging is highly case-by-case, providing general starting point is good but might not be helpful in the end.

I'm open to discuss where we should put our docs and how to organize them.

wwwjn · 2025-10-06T18:15:55Z

docs/troubleshooting.md

+   ```toml
+   [model]
+   fsdp_cpu_offload = true
+   ```


Agree, to resolve OOM issue, it's a very case-by-case decision and a general guide here might not be that helpful

wwwjn · 2025-10-06T18:16:23Z

docs/troubleshooting.md

+   nvidia-smi
+   ```
+
+3. **Verify network connectivity** (multi-node):


This part is not even related to torchtitan

wwwjn · 2025-10-06T18:21:23Z

docs/troubleshooting.md

+     --output_path ./checkpoint
+   ```
+
+### Loss is NaN or Diverging


Also this issue might be very different under different cases: Simpler case might be learning rate is too high, but sometimes it even because of a pytorch or torchtitan bug

Maybe also good to add how to enable anomaly detection.

wwwjn · 2025-10-06T18:24:59Z

docs/troubleshooting.md

+2. **Verify all nodes can communicate**:
+   ```bash
+   # From each node, ping master
+   ping $MASTER_ADDR
+   ```
+
+3. **Check firewall settings** - ensure port is open


This part does not seem to be related to torchtitan

wwwjn · 2025-10-06T18:26:56Z

docs/troubleshooting.md

+--profiling.no-enable-memory-snapshot
+```
+
+See [debugging.md](debugging.md#overriding-boolean-flags-from-toml-via-cli) for more details.


As you mentioned, this part is overlapped with debugging.md

tianyu-l

This is a nice doc to have.

But I have the worry that this document explicitly mentions a lot of configs, which would be easy to break, or significantly slow down the development / refactor.

I don't want the torchtitan core team to maintain this doc. Maybe it should be part of the broader releasing engineering's job.

cc @dcci @svekars

tianyu-l · 2025-10-06T18:55:04Z

docs/troubleshooting.md

+     --output_path ./checkpoint
+   ```
+
+### Loss is NaN or Diverging


Maybe also good to add how to enable anomaly detection.

dcci · 2025-10-06T19:02:00Z

This is a nice doc to have.

But I have the worry that this document explicitly mentions a lot of configs, which would be easy to break, or significantly slow down the development / refactor.

I don't want the torchtitan core team to maintain this doc. Maybe it should be part of the broader releasing engineering's job.

cc @dcci @svekars

Agree. The maintenance cost of this is relatively high.

khlaifiabilel requested review from tianyu-l, fegin, wwwjn and wconstab as code owners October 6, 2025 12:08

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Oct 6, 2025

fegin reviewed Oct 6, 2025

View reviewed changes

wwwjn reviewed Oct 6, 2025

View reviewed changes

tianyu-l requested changes Oct 6, 2025

View reviewed changes


		### Training Hangs at Start

		Problem: Training process hangs without error messages

Docs: Add comprehensive troubleshooting guide #1801

Are you sure you want to change the base?

Docs: Add comprehensive troubleshooting guide #1801

Conversation

khlaifiabilel commented Oct 6, 2025

Summary

Motivation

What's Included

Setup & Installation

Training Issues

Configuration Issues

Distributed Training

Performance Issues

Getting Help

Testing

Type of Change

Additional Context

Checklist

Uh oh!

meta-cla bot commented Oct 6, 2025

Action Required

Process

Uh oh!

meta-cla bot commented Oct 6, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wwwjn left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tianyu-l left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dcci commented Oct 6, 2025

Uh oh!

Uh oh!