readme

Jonathan Mitchell · Jonathan Mitchell · commit 82e61051ca94 · 2025-09-16T13:57:01.000-07:00
Signed-off-by: Jonathan Mitchell &lt;jomitchell@login-eos02.eos.clusters.nvidia.com&gt;
diff --git a/examples/pytorch/transformer/README.md b/examples/pytorch/transformer/README.md
@@ -2,18 +2,11 @@
 
 This directory contains a comprehensive testing framework for validating **Context Parallelism (CP)** in TransformerEngine. The framework compares single-GPU baseline runs against distributed multi-GPU runs to ensure numerical consistency and correctness.
 
-**Key Features:**
-- 🔄 Automatic synchronization using file-based completion markers
-- 📊 Configurable logging verbosity for cleaner output
-- 🎯 Dynamic sequence dimension handling for BSHD/SBHD formats
-- ⚡ Optimized device placement for reduced CPU-GPU overhead
-- 🛡️ Comprehensive shape validation and error handling
-
 ## 🏗️ Architecture Overview
 
 ### Test Models
 
-The framework supports two attention formats, each with its own model implementation in `model.py`:
+The framework supports two attention input formats, each with its own model implementation in `model.py`:
 
 **1. SimpleThDModel (THD Format - Token-Head-Dimension):**
 - Processes sequences in flattened token format
@@ -76,7 +69,7 @@ The framework runs parallel test suites for both THD and BSHD formats:
 - `context_parallel_runner_thd.py` (THD format)
 - `context_parallel_runner_bshd.py` (BSHD format)
 
-Both runners execute on a single GPU:
+Torchrun is used for both programs, at first no parallelism is used and we have two identical forward passes (one per gpu).
 1. **Model Creation**: Initialize model (SimpleThDModel or SimpleBSHDModel)
 2. **Forward Pass**: Process full sequences without parallelization
 3. **Loss Computation**: Cross-entropy on valid (non-padded) tokens
@@ -87,7 +80,7 @@ Both runners execute on a single GPU:
 5. **State Persistence**: Save model weights and results to `/tmp/` for CP=2 comparison
 
 ### Phase 2: Distributed Run (CP=2)
-**Execution**: Same runner files in distributed mode via `torchrun`
+Then torch distributed is initialized and we use context parallel=2, both gpus now participate in a forward pass.
 
 1. **Process Group Setup**: Initialize NCCL backend for 2 GPUs
 2. **Device Mesh**: Create `(fsdp=1, cp=2, tp=1)` parallelization strategy
@@ -140,7 +133,7 @@ LOGITS_ELEMENT_TOLERANCE = 2e-2   # Individual element tolerance
 
 **Success Criteria**: ≥85% of logit elements must be within 2e-2 absolute difference
 
-**Why 85%?** Distributed computation with mixed precision (bfloat16) introduces expected numerical differences. This threshold balances strictness with practical distributed computing realities.
+**Why 85%?** Distributed computation with mixed precision (bfloat16) introduces expected numerical differences. This threshold balances strictness with practical distributed computing realities. Moreover, we notice that as we increase the number of hidden layers (TE Layers), we see the numerical differences between the non CP and CP counterparts increase.
 
 ### Test 5: Loss Similarity
 - Compares averaged losses from both CP ranks
@@ -172,7 +165,6 @@ The framework uses **scientifically calibrated tolerances** based on:
 1. **Mixed Precision Effects**: bfloat16 has ~3-4 decimal digits of precision
 2. **Distributed Communication**: AllReduce operations introduce small numerical errors
 3. **Computation Order**: Different operation sequences in CP vs non-CP modes
-4. **Hardware Variations**: GPU-to-GPU numerical differences
 
 **Conservative but Practical**: Tolerances are tight enough to catch real bugs while loose enough to handle expected distributed computing variations.
 
@@ -182,10 +174,6 @@ The framework uses **scientifically calibrated tolerances** based on:
 ```bash
 # Run both THD and BSHD format tests with automatic synchronization
 bash run_context_parallel.sh
-
-# Run individual format tests
-bash run_context_parallel_thd.sh  # THD format only
-bash run_context_parallel_bshd.sh # BSHD format only
 ```
 
 ### Script Features
@@ -258,63 +246,12 @@ LOGITS_ELEMENT_TOLERANCE = 2e-2       # Stricter: 1e-2, Looser: 5e-2
 GRADIENT_MAX_ABSOLUTE_DIFF_TOLERANCE = 0.05  # Stricter: 0.01, Looser: 0.1
 ```
 
-## 🎯 What This Framework Validates
+## 🎯 What This Folder Validates
 
 ✅ **Numerical Correctness**: CP=2 produces equivalent results to CP=1  
 ✅ **Format Compatibility**: Both THD and BSHD attention formats work correctly
 ✅ **Gradient Consistency**: Distributed training gradients match single-GPU  
 ✅ **Loss Preservation**: Training objectives remain unchanged  
 ✅ **Sequence Reconstruction**: Distributed chunks correctly reassemble  
 ✅ **Memory Efficiency**: Context parallelism reduces per-GPU memory usage  
-✅ **Scalability**: Framework extends to larger CP sizes and models
-
-## 🔍 Debugging Failed Tests
-
-**Logits Test Failure**: Check for model weight synchronization issues or incorrect sequence chunking
-
-**Gradient Test Failure**: Investigate DDP configuration or learning rate scaling
-
-**Loss Test Failure**: Verify identical random seeds and data preprocessing
-
-**Reconstruction Failure**: Debug `get_batch_on_this_cp_rank()` slicing logic for the specific format (THD vs BSHD)
-
-**Format-Specific Issues**: 
-- THD: Check cumulative sequence length calculations
-- BSHD: Verify batch dimension handling in reconstruction
-
-## 📊 Performance Characteristics
-
-- **Model Size**: ~2.1M parameters (lightweight but representative)
-- **Memory Usage**: ~50MB per GPU (enables testing on modest hardware)  
-- **Runtime**: ~30 seconds per format test suite (fast iteration cycles)
-- **Scalability**: Easily extends to larger models and more GPUs
-- **Logging**: Clean output by default, verbose logs available on demand
-
-## 🔄 Recent Improvements
-
-### Core Functionality
-- **Dynamic Sequence Dimension Handling**: Automatically determines correct dimension based on tensor format (BSHD vs SBHD)
-- **Optimized Device Placement**: Index tensors now created directly on target device, eliminating CPU-GPU sync overhead
-- **Comprehensive Shape Validation**: Added pre-reshape checks with informative error messages
-- **Enhanced Error Handling**: Proper exception handling for shape mismatches and invalid operations
-
-### Testing Infrastructure
-- **File-Based Completion Markers**: Reliable synchronization between distributed processes
-- **Automatic Process Coordination**: Ensures sequential execution of multiple `torchrun` commands
-- **Configurable Timeout Protection**: Prevents hanging on failed distributed runs
-- **Clean Marker Management**: Automatic cleanup of old completion markers
-
-### Developer Experience
-- **Dual Format Support**: Added BSHD format alongside original THD format
-- **Cleaner Output**: Replaced verbose print statements with configurable Python logging
-- **Reduced Code Duplication**: Consolidated common logic between test suites
-- **Better Debugging**: Rank-aware logging in distributed mode
-- **Professional Structure**: Removed redundant comments and improved code organization
-- **Flexible Verbosity**: Toggle between clean and detailed output modes
-
-### Test Coverage
-- **New BSHD/SBHD Format Tests**: Added comprehensive tests in `test_cp_utils.py`
-- **Format-Specific Validation**: Separate test suites for THD and BSHD formats
-- **Improved Test Isolation**: Each test properly manages its own state
-
-This framework provides a **robust, automated, and scientifically sound** approach to validating context parallelism implementations in TransformerEngine for multiple attention formats.
+✅ **Scalability**: Folder extends to larger CP sizes and models
diff --git a/examples/pytorch/transformer/test_context_parallel_bshd.py b/examples/pytorch/transformer/test_context_parallel_bshd.py
@@ -250,7 +250,7 @@ def test_cp_indices_calculation(load_test_data):
     
     # Get baseline logits to determine batch size and sequence length
     cp1_logits = test_data['cp1_results']['logits']
-    batch_size, seq_len, vocab_size = cp1_logits.shape
+    batch_size, seq_len, _ = cp1_logits.shape
     
     # Calculate indices for CP=2
     rank_indices = calculate_cp_indices_bshd(batch_size, seq_len, cp_size=2)