You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: examples/pytorch/transformer/README.md
+6-69Lines changed: 6 additions & 69 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -2,18 +2,11 @@
2
2
3
3
This directory contains a comprehensive testing framework for validating **Context Parallelism (CP)** in TransformerEngine. The framework compares single-GPU baseline runs against distributed multi-GPU runs to ensure numerical consistency and correctness.
4
4
5
-
**Key Features:**
6
-
- 🔄 Automatic synchronization using file-based completion markers
7
-
- 📊 Configurable logging verbosity for cleaner output
8
-
- 🎯 Dynamic sequence dimension handling for BSHD/SBHD formats
9
-
- ⚡ Optimized device placement for reduced CPU-GPU overhead
10
-
- 🛡️ Comprehensive shape validation and error handling
11
-
12
5
## 🏗️ Architecture Overview
13
6
14
7
### Test Models
15
8
16
-
The framework supports two attention formats, each with its own model implementation in `model.py`:
9
+
The framework supports two attention input formats, each with its own model implementation in `model.py`:
17
10
18
11
**1. SimpleThDModel (THD Format - Token-Head-Dimension):**
19
12
- Processes sequences in flattened token format
@@ -76,7 +69,7 @@ The framework runs parallel test suites for both THD and BSHD formats:
76
69
-`context_parallel_runner_thd.py` (THD format)
77
70
-`context_parallel_runner_bshd.py` (BSHD format)
78
71
79
-
Both runners execute on a single GPU:
72
+
Torchrun is used for both programs, at first no parallelism is used and we have two identical forward passes (one per gpu).
80
73
1.**Model Creation**: Initialize model (SimpleThDModel or SimpleBSHDModel)
81
74
2.**Forward Pass**: Process full sequences without parallelization
82
75
3.**Loss Computation**: Cross-entropy on valid (non-padded) tokens
@@ -87,7 +80,7 @@ Both runners execute on a single GPU:
87
80
5.**State Persistence**: Save model weights and results to `/tmp/` for CP=2 comparison
88
81
89
82
### Phase 2: Distributed Run (CP=2)
90
-
**Execution**: Same runner files in distributed mode via `torchrun`
83
+
Then torch distributed is initialized and we use context parallel=2, both gpus now participate in a forward pass.
91
84
92
85
1.**Process Group Setup**: Initialize NCCL backend for 2 GPUs
**Success Criteria**: ≥85% of logit elements must be within 2e-2 absolute difference
142
135
143
-
**Why 85%?** Distributed computation with mixed precision (bfloat16) introduces expected numerical differences. This threshold balances strictness with practical distributed computing realities.
136
+
**Why 85%?** Distributed computation with mixed precision (bfloat16) introduces expected numerical differences. This threshold balances strictness with practical distributed computing realities. Moreover, we notice that as we increase the number of hidden layers (TE Layers), we see the numerical differences between the non CP and CP counterparts increase.
144
137
145
138
### Test 5: Loss Similarity
146
139
- Compares averaged losses from both CP ranks
@@ -172,7 +165,6 @@ The framework uses **scientifically calibrated tolerances** based on:
172
165
1.**Mixed Precision Effects**: bfloat16 has ~3-4 decimal digits of precision
173
166
2.**Distributed Communication**: AllReduce operations introduce small numerical errors
174
167
3.**Computation Order**: Different operation sequences in CP vs non-CP modes
- BSHD: Verify batch dimension handling in reconstruction
284
-
285
-
## 📊 Performance Characteristics
286
-
287
-
-**Model Size**: ~2.1M parameters (lightweight but representative)
288
-
-**Memory Usage**: ~50MB per GPU (enables testing on modest hardware)
289
-
-**Runtime**: ~30 seconds per format test suite (fast iteration cycles)
290
-
-**Scalability**: Easily extends to larger models and more GPUs
291
-
-**Logging**: Clean output by default, verbose logs available on demand
292
-
293
-
## 🔄 Recent Improvements
294
-
295
-
### Core Functionality
296
-
-**Dynamic Sequence Dimension Handling**: Automatically determines correct dimension based on tensor format (BSHD vs SBHD)
297
-
-**Optimized Device Placement**: Index tensors now created directly on target device, eliminating CPU-GPU sync overhead
298
-
-**Comprehensive Shape Validation**: Added pre-reshape checks with informative error messages
299
-
-**Enhanced Error Handling**: Proper exception handling for shape mismatches and invalid operations
300
-
301
-
### Testing Infrastructure
302
-
-**File-Based Completion Markers**: Reliable synchronization between distributed processes
303
-
-**Automatic Process Coordination**: Ensures sequential execution of multiple `torchrun` commands
304
-
-**Configurable Timeout Protection**: Prevents hanging on failed distributed runs
305
-
-**Clean Marker Management**: Automatic cleanup of old completion markers
306
-
307
-
### Developer Experience
308
-
-**Dual Format Support**: Added BSHD format alongside original THD format
309
-
-**Cleaner Output**: Replaced verbose print statements with configurable Python logging
310
-
-**Reduced Code Duplication**: Consolidated common logic between test suites
311
-
-**Better Debugging**: Rank-aware logging in distributed mode
312
-
-**Professional Structure**: Removed redundant comments and improved code organization
313
-
-**Flexible Verbosity**: Toggle between clean and detailed output modes
314
-
315
-
### Test Coverage
316
-
-**New BSHD/SBHD Format Tests**: Added comprehensive tests in `test_cp_utils.py`
317
-
-**Format-Specific Validation**: Separate test suites for THD and BSHD formats
318
-
-**Improved Test Isolation**: Each test properly manages its own state
319
-
320
-
This framework provides a **robust, automated, and scientifically sound** approach to validating context parallelism implementations in TransformerEngine for multiple attention formats.
257
+
✅ **Scalability**: Folder extends to larger CP sizes and models
0 commit comments