-
Notifications
You must be signed in to change notification settings - Fork 93
Open
Description
hi,
I’ve encountered a performance issue during the Snapshot analysis stage within a single scheduling cycle.
To evaluate the performance of KAI in a production-like scenario, we can only conduct a simulated scheduling test rather than deploying it directly in the production cluster.
Test Method
- Generate a mocked Snapshot (close to a production cluster)
a. Generate a Snapshot with info from a production cluster (no KAI installed)
b. Modify the GPU pods in the snapshot to utilize KAI
c. Add a new unscheduled workload to the snapshot - Execute a KAI scheduling cycle on the snapshot
- Record performance metrics: Only the scheduling time was measured, with snapshot loading and parsing excluded from performance metrics.
Cluster info
- Node num: 4000+ (including 100+ GPU nodes)
- Pod num: 140000+ (including 500+ GPU pods)
Test Results
Case | Workload | ResourceRequest | Scheduling Result | Time Cost 🔴: Snapshot analysis time 🟢: Scheduling action time |
---|---|---|---|---|
1 | Deployment Replicas=6 |
4 H100/Pod | Replica 0-2: allocate Replica 3-5: reclaim |
2.506s = 🔴2.216s + 🟢290ms |
2 | Deployment Replicas=6 |
4 H200/Pod | Replica 0-5: allocate | 2.351s = 🔴2.175s + 🟢176ms |
3 | Deployment Replicas=20 |
4 H200/Pod | Replica 0-18: allocate Replica 19 : reclaim |
2.430s = 🔴2.174s + 🟢256ms |
4 | Deployment Replicas=20 |
4 A100/Pod | Replica 0-9: allocate Replica 10-19: reclaim |
2.851s = 🔴2.181s + 🟢670ms |
Observations
I found that a significant portion of time (~2s) was spent in the snapshot analysis stage within a single scheduling cycle.
This stage appears to involve generating a node-to-pods mapping from the snapshot data.
KAI-Scheduler/pkg/scheduler/cache/cluster_info/cluster_info.go
Lines 134 to 138 in 77358e0
snapshot.Pods, err = c.addTasksToNodes(allPods, existingPods, snapshot.Nodes, snapshot.BindRequests) | |
if err != nil { | |
err = errors.WithStack(fmt.Errorf("error adding tasks to nodes: %c", err)) | |
return nil, err | |
} |
itsomri
Metadata
Metadata
Assignees
Labels
No labels