Skip to content

Conversation

khlaifiabilel
Copy link

Summary

This PR adds a comprehensive troubleshooting guide to help users resolve common issues encountered when using TorchTitan.

Motivation

  • Users frequently encounter similar issues (OOM errors, setup problems, configuration errors)
  • Reduces repetitive questions on PyTorch forums and GitHub issues
  • Provides a central reference for debugging and problem-solving
  • Makes it easier for new users to get started

What's Included

Setup & Installation

  • GPU detection issues and PyTorch CUDA configuration
  • Tokenizer download failures and HuggingFace authentication
  • Import errors and PYTHONPATH configuration
  • Pre-commit hook installation and troubleshooting

Training Issues

  • Out of Memory (OOM) errors with multiple solutions (batch size, activation checkpointing, gradient accumulation, CPU offload)
  • Training hangs and debugging with environment variables
  • Checkpoint loading failures and compatibility issues
  • Loss divergence (NaN/Inf) with solutions for learning rate, gradient clipping, and deterministic debugging

Configuration Issues

  • Boolean flag overrides from CLI (using --no prefix)
  • Config validation errors and debugging
  • Parallelism dimension conflicts with clear examples

Distributed Training

  • Multi-node setup issues (network configuration, master address/port)
  • NCCL communication errors and timeout settings
  • Network interface configuration

Performance Issues

  • Low GPU utilization diagnosis and solutions
  • Slow training speed optimization tips
  • Profiling commands and benchmarking guidance

Getting Help

  • Where to ask questions (Forums, GitHub Issues, Discussions)
  • How to report issues effectively with required information
  • Useful debugging commands for quick diagnosis
  • Links to related documentation

Testing

  • All commands have been verified to work
  • Pre-commit hooks pass (trailing whitespace fixed)
  • Cross-references to other docs are correct (debugging.md, fsdp.md, checkpoint.md)
  • Markdown formatting is valid

Type of Change

  • Documentation update

Additional Context

This guide complements existing documentation by providing quick, actionable solutions to common problems. It's designed to be easily searchable and scannable for users who are stuck. The guide focuses on practical solutions with copy-pasteable commands and clear examples.

Future enhancements could include:

  • Hardware-specific troubleshooting (AMD GPUs, different CUDA versions)
  • Visual flowcharts for issue diagnosis
  • Cloud provider-specific setup guides
  • Video tutorials for complex multi-node setups

Checklist

  • Documentation follows the style of this project
  • Pre-commit hooks pass
  • All links are valid and working
  • Cross-references to other docs are correct
  • I have read the CONTRIBUTING.md document

- Add troubleshooting guide covering common setup and training issues
- Include solutions for OOM errors, GPU detection, and tokenizer downloads
- Cover distributed training issues (multi-node setup, NCCL errors)
- Add configuration troubleshooting (boolean flags, parallelism settings)
- Include performance debugging tips and monitoring commands
- Provide clear examples and commands for each solution
- Add links to related documentation and resources

This guide will help users quickly resolve common problems and reduce
repetitive questions on forums and GitHub issues.
Copy link

meta-cla bot commented Oct 6, 2025

Hi @khlaifiabilel!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at [email protected]. Thanks!

Copy link

meta-cla bot commented Oct 6, 2025

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks!

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Oct 6, 2025
Comment on lines +139 to +142
```toml
[model]
fsdp_cpu_offload = true
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are other approaches to do first like TP and effectively reducing the global batch size. CPU offload can be very slow and should be the last resort?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, to resolve OOM issue, it's a very case-by-case decision and a general guide here might not be that helpful

4. **Start with debug model**:
```bash
CONFIG_FILE="./torchtitan/models/llama3/train_configs/debug_model.toml" ./run_train.sh
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add py-spy to understand which rank is doing what.


### Training Hangs at Start

**Problem:** Training process hangs without error messages
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sometimes, it is because we only output rank0 log and cause some ranks errors are swallowed by TorchRun. LOG_RANK=0,1,2,3,4,5,6,7 can help. I used this a lot.


2. **Check checkpoint compatibility**:
- Ensure same model architecture
- Verify parallelism settings match
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Parallelism settings match is not required.


1. **Set NCCL timeout**:
```bash
export NCCL_TIMEOUT=1800
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TorchTitan has a comm timeout setting. We should just use that?

compile = true
```

3. **Check data loading** isn't the bottleneck:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you ever encountered this case when using TorchTitan or have you heard such a case?

```

2. **Check TFLOPs and MFU** in logs:
- Compare with benchmark numbers in `benchmarks/`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And check the trace to understand if there are exposed communications.

python -m torchtitan.config.manager --job.config_file config.toml

# Run with maximum debugging
export TORCH_DISTRIBUTED_DEBUG=DETAIL
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if we want to advocate this. Given that we now have NCCL flight recorder. It is usually easier to just check flight recorder result.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One most common use case people turn on these debug information is NCCL timeout. We should just focus on NCCL flight recorder. Other debug log is likely too expert-only information.

Copy link
Contributor

@wwwjn wwwjn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution, I think it's a good guide for torchtitan new users! However, I think it's better to be made into a blog / tutorial, but not put into torchtitan/docs part, because of following reasons:

  1. In torchtitan docs section, the instructions are divided into different topics, and this guide has a lot overlap with other docs.
  2. There are basic GPU usage knowledge, model training knowledge, which is not specific related to torchtian.
  3. Sometimes debugging is highly case-by-case, providing general starting point is good but might not be helpful in the end.

I'm open to discuss where we should put our docs and how to organize them.

Comment on lines +139 to +142
```toml
[model]
fsdp_cpu_offload = true
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, to resolve OOM issue, it's a very case-by-case decision and a general guide here might not be that helpful

nvidia-smi
```

3. **Verify network connectivity** (multi-node):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part is not even related to torchtitan

--output_path ./checkpoint
```

### Loss is NaN or Diverging
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also this issue might be very different under different cases: Simpler case might be learning rate is too high, but sometimes it even because of a pytorch or torchtitan bug

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe also good to add how to enable anomaly detection.

Comment on lines +313 to +319
2. **Verify all nodes can communicate**:
```bash
# From each node, ping master
ping $MASTER_ADDR
```

3. **Check firewall settings** - ensure port is open
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part does not seem to be related to torchtitan

--profiling.no-enable-memory-snapshot
```

See [debugging.md](debugging.md#overriding-boolean-flags-from-toml-via-cli) for more details.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you mentioned, this part is overlapped with debugging.md

Copy link
Contributor

@tianyu-l tianyu-l left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a nice doc to have.

But I have the worry that this document explicitly mentions a lot of configs, which would be easy to break, or significantly slow down the development / refactor.

I don't want the torchtitan core team to maintain this doc. Maybe it should be part of the broader releasing engineering's job.

cc @dcci @svekars

--output_path ./checkpoint
```

### Loss is NaN or Diverging
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe also good to add how to enable anomaly detection.

@dcci
Copy link
Member

dcci commented Oct 6, 2025

This is a nice doc to have.

But I have the worry that this document explicitly mentions a lot of configs, which would be easy to break, or significantly slow down the development / refactor.

I don't want the torchtitan core team to maintain this doc. Maybe it should be part of the broader releasing engineering's job.

cc @dcci @svekars

Agree. The maintenance cost of this is relatively high.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants