Skip to content

[Performance] Snapshot analysis completes in ~2 seconds on large clusters #315

@hello2mao

Description

@hello2mao

hi,

I’ve encountered a performance issue during the Snapshot analysis stage within a single scheduling cycle.

To evaluate the performance of KAI in a production-like scenario, we can only conduct a simulated scheduling test rather than deploying it directly in the production cluster.

Test Method

  1. Generate a mocked Snapshot (close to a production cluster)
    a. Generate a Snapshot with info from a production cluster (no KAI installed)
    b. Modify the GPU pods in the snapshot to utilize KAI
    c. Add a new unscheduled workload to the snapshot
  2. Execute a KAI scheduling cycle on the snapshot
  3. Record performance metrics: Only the scheduling time was measured, with snapshot loading and parsing excluded from performance metrics.

Cluster info

  • Node num: 4000+ (including 100+ GPU nodes)
  • Pod num: 140000+ (including 500+ GPU pods)

Test Results

Case Workload ResourceRequest Scheduling Result Time Cost
🔴: Snapshot analysis time
🟢: Scheduling action time
1 Deployment
Replicas=6
4 H100/Pod Replica 0-2: allocate
Replica 3-5: reclaim
2.506s = 🔴2.216s + 🟢290ms
2 Deployment
Replicas=6
4 H200/Pod Replica 0-5: allocate 2.351s = 🔴2.175s + 🟢176ms
3 Deployment
Replicas=20
4 H200/Pod Replica 0-18: allocate
Replica 19    : reclaim
2.430s = 🔴2.174s + 🟢256ms
4 Deployment
Replicas=20
4 A100/Pod Replica 0-9: allocate
Replica 10-19: reclaim
2.851s = 🔴2.181s + 🟢670ms

Observations

I found that a significant portion of time (~2s) was spent in the snapshot analysis stage within a single scheduling cycle.
This stage appears to involve generating a node-to-pods mapping from the snapshot data.

snapshot.Pods, err = c.addTasksToNodes(allPods, existingPods, snapshot.Nodes, snapshot.BindRequests)
if err != nil {
err = errors.WithStack(fmt.Errorf("error adding tasks to nodes: %c", err))
return nil, err
}

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions