-
Notifications
You must be signed in to change notification settings - Fork 809
[SYCL][PI][L0] Add dynamic batch size adjustment #2792
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
16 commits
Select commit
Hold shift + click to select a range
25ce7a7
Merge pull request #1 from intel/sycl
kbsmith-intel 9f0b68f
Merge pull request #2 from intel/sycl
kbsmith-intel 408da21
Merge pull request #3 from intel/sycl
kbsmith-intel 455cabe
Merge pull request #4 from intel/sycl
kbsmith-intel c6ce758
Merge pull request #5 from intel/sycl
kbsmith-intel ff97af7
Merge pull request #6 from intel/sycl
kbsmith-intel 601e9a4
Merge pull request #7 from intel/sycl
kbsmith-intel 187f98a
Add dynamic batch size adjustment
kbsmith-intel c0f0e1e
Fixes mispelling of dynamic in documentation.
kbsmith-intel 348ad62
Fixes code formating error in dynamic batching changes
kbsmith-intel 3b6ab9f
Changes level zero batch test to turn off dynamic batching.
kbsmith-intel cfcc95c
Adds a new unit test for dynamic batch size feature
kbsmith-intel e983a63
Removes extra end-of-line whitespace in new test source.
kbsmith-intel 3612fed
Fix typos in the new test level_zero_dynamic_batch_test.cpp
kbsmith-intel 65b7087
Updates dynamic batching to address review comments.
kbsmith-intel 0df64e2
Adjust dynamic batching maximum batch size and start size.
kbsmith-intel File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,219 @@ | ||
// REQUIRES: gpu, level_zero | ||
|
||
// RUN: %clangxx -fsycl -fsycl-targets=%sycl_triple %s -o %t.out | ||
|
||
// Check that dynamic batching increases batch size | ||
// RUN: env SYCL_PI_TRACE=2 ZE_DEBUG=1 %GPU_RUN_PLACEHOLDER %t.out 2>&1 | FileCheck --check-prefixes=CKALL,CKDYNUP %s | ||
|
||
// level_zero_dynamic_batch_test.cpp | ||
// | ||
// This tests the level zero plugin's kernel dyanmic batch size adjustment | ||
// code. | ||
// It starts out by enqueing 40 kernels before it does a wait, and it does | ||
// this 5 times. That should cause the dynamic batch size adjustment to | ||
// raise the batch size up several times. | ||
// | ||
// Then the test starts enqueueing only 4 kernels before doing a wait, and | ||
// it does that 5 times as well. That should cause the batch size to | ||
// be lowered, just once to be less than 4. | ||
// | ||
// CKDYN: Raising QueueBatchSize to 3 | ||
// CKDYN: Raising QueueBatchSize to 4 | ||
// CKDYN-NOT: Raising QueueBatchSize | ||
// CKALL: Test Pass | ||
// CKALL: Test Pass | ||
// CKALL: Test Pass | ||
// CKALL: Test Pass | ||
// CKALL: Test Pass | ||
// CKALL: Test Pass | ||
// CKALL: Test Pass | ||
// CKDYN: Lowering QueueBatchSize to 3 | ||
// CKDYN-NOT: Lowering QueueBatchSize | ||
// CKALL: Test Pass | ||
// CKALL: Test Pass | ||
// CKALL: Test Pass | ||
// CKALL: Test Pass | ||
|
||
#include "CL/sycl.hpp" | ||
#include <chrono> | ||
#include <cmath> | ||
#include <iostream> | ||
|
||
namespace sycl = cl::sycl; | ||
|
||
void validate(uint32_t *result, uint32_t *expect, size_t n) { | ||
int error = 0; | ||
for (int i = 0; i < n; i++) { | ||
if (result[i] != expect[i]) { | ||
error++; | ||
if (error < 10) { | ||
printf("Error: %d, expect: %d\n", result[i], expect[i]); | ||
} | ||
} | ||
} | ||
error > 0 ? printf("Error: %d\n", error) : printf("Test Pass\n"); | ||
} | ||
|
||
int main(int argc, char *argv[]) { | ||
size_t M = 65536; | ||
size_t N = 512 / 4; | ||
size_t AL = M * N * sizeof(uint32_t); | ||
|
||
sycl::queue q(sycl::default_selector{}); | ||
auto ctx = q.get_context(); | ||
auto dev = q.get_device(); | ||
|
||
uint32_t *Y1 = static_cast<uint32_t *>(sycl::malloc_shared(AL, dev, ctx)); | ||
uint32_t *Z1 = static_cast<uint32_t *>(sycl::malloc_shared(AL, dev, ctx)); | ||
uint32_t *Z2 = static_cast<uint32_t *>(sycl::malloc_shared(AL, dev, ctx)); | ||
uint32_t *Z3 = static_cast<uint32_t *>(sycl::malloc_shared(AL, dev, ctx)); | ||
uint32_t *Z4 = static_cast<uint32_t *>(sycl::malloc_shared(AL, dev, ctx)); | ||
uint32_t *Z5 = static_cast<uint32_t *>(sycl::malloc_shared(AL, dev, ctx)); | ||
uint32_t *Z6 = static_cast<uint32_t *>(sycl::malloc_shared(AL, dev, ctx)); | ||
uint32_t *Z7 = static_cast<uint32_t *>(sycl::malloc_shared(AL, dev, ctx)); | ||
uint32_t *Z8 = static_cast<uint32_t *>(sycl::malloc_shared(AL, dev, ctx)); | ||
|
||
for (size_t i = 0; i < M * N; i++) { | ||
Y1[i] = i % 255; | ||
} | ||
|
||
memset(Z1, '\0', AL); | ||
memset(Z2, '\0', AL); | ||
memset(Z3, '\0', AL); | ||
memset(Z4, '\0', AL); | ||
memset(Z5, '\0', AL); | ||
memset(Z6, '\0', AL); | ||
memset(Z7, '\0', AL); | ||
memset(Z8, '\0', AL); | ||
|
||
for (size_t i = 0; i < 5; i++) { | ||
for (size_t j = 0; j < 5; j++) { | ||
q.submit([&](sycl::handler &h) { | ||
h.parallel_for<class u32_copy1>(sycl::range<2>{M, N}, | ||
[=](sycl::id<2> it) { | ||
const int m = it[0]; | ||
const int n = it[1]; | ||
Z1[m * N + n] = Y1[m * N + n]; | ||
}); | ||
}); | ||
q.submit([&](sycl::handler &h) { | ||
h.parallel_for<class u32_copy2>(sycl::range<2>{M, N}, | ||
[=](sycl::id<2> it) { | ||
const int m = it[0]; | ||
const int n = it[1]; | ||
Z2[m * N + n] = Y1[m * N + n]; | ||
}); | ||
}); | ||
q.submit([&](sycl::handler &h) { | ||
h.parallel_for<class u32_copy3>(sycl::range<2>{M, N}, | ||
[=](sycl::id<2> it) { | ||
const int m = it[0]; | ||
const int n = it[1]; | ||
Z3[m * N + n] = Y1[m * N + n]; | ||
}); | ||
}); | ||
q.submit([&](sycl::handler &h) { | ||
h.parallel_for<class u32_copy4>(sycl::range<2>{M, N}, | ||
[=](sycl::id<2> it) { | ||
const int m = it[0]; | ||
const int n = it[1]; | ||
Z4[m * N + n] = Y1[m * N + n]; | ||
}); | ||
}); | ||
q.submit([&](sycl::handler &h) { | ||
h.parallel_for<class u32_copy5>(sycl::range<2>{M, N}, | ||
[=](sycl::id<2> it) { | ||
const int m = it[0]; | ||
const int n = it[1]; | ||
Z5[m * N + n] = Y1[m * N + n]; | ||
}); | ||
}); | ||
q.submit([&](sycl::handler &h) { | ||
h.parallel_for<class u32_copy6>(sycl::range<2>{M, N}, | ||
[=](sycl::id<2> it) { | ||
const int m = it[0]; | ||
const int n = it[1]; | ||
Z6[m * N + n] = Y1[m * N + n]; | ||
}); | ||
}); | ||
q.submit([&](sycl::handler &h) { | ||
h.parallel_for<class u32_copy7>(sycl::range<2>{M, N}, | ||
[=](sycl::id<2> it) { | ||
const int m = it[0]; | ||
const int n = it[1]; | ||
Z7[m * N + n] = Y1[m * N + n]; | ||
}); | ||
}); | ||
q.submit([&](sycl::handler &h) { | ||
h.parallel_for<class u32_copy8>(sycl::range<2>{M, N}, | ||
[=](sycl::id<2> it) { | ||
const int m = it[0]; | ||
const int n = it[1]; | ||
Z8[m * N + n] = Y1[m * N + n]; | ||
}); | ||
}); | ||
} | ||
q.wait(); | ||
} | ||
|
||
validate(Y1, Z1, M * N); | ||
validate(Y1, Z2, M * N); | ||
validate(Y1, Z3, M * N); | ||
validate(Y1, Z4, M * N); | ||
validate(Y1, Z5, M * N); | ||
validate(Y1, Z6, M * N); | ||
validate(Y1, Z7, M * N); | ||
validate(Y1, Z8, M * N); | ||
|
||
for (size_t i = 0; i < 5; i++) { | ||
q.submit([&](sycl::handler &h) { | ||
h.parallel_for<class u32_copy9>(sycl::range<2>{M, N}, | ||
[=](sycl::id<2> it) { | ||
const int m = it[0]; | ||
const int n = it[1]; | ||
Z1[m * N + n] = Y1[m * N + n]; | ||
}); | ||
}); | ||
q.submit([&](sycl::handler &h) { | ||
h.parallel_for<class u32_copy10>(sycl::range<2>{M, N}, | ||
[=](sycl::id<2> it) { | ||
const int m = it[0]; | ||
const int n = it[1]; | ||
Z2[m * N + n] = Y1[m * N + n]; | ||
}); | ||
}); | ||
q.submit([&](sycl::handler &h) { | ||
h.parallel_for<class u32_copy11>(sycl::range<2>{M, N}, | ||
[=](sycl::id<2> it) { | ||
const int m = it[0]; | ||
const int n = it[1]; | ||
Z3[m * N + n] = Y1[m * N + n]; | ||
}); | ||
}); | ||
q.submit([&](sycl::handler &h) { | ||
h.parallel_for<class u32_copy12>(sycl::range<2>{M, N}, | ||
[=](sycl::id<2> it) { | ||
const int m = it[0]; | ||
const int n = it[1]; | ||
Z4[m * N + n] = Y1[m * N + n]; | ||
}); | ||
}); | ||
q.wait(); | ||
} | ||
validate(Y1, Z1, M * N); | ||
validate(Y1, Z2, M * N); | ||
validate(Y1, Z3, M * N); | ||
validate(Y1, Z4, M * N); | ||
|
||
sycl::free(Y1, ctx); | ||
sycl::free(Z1, ctx); | ||
sycl::free(Z2, ctx); | ||
sycl::free(Z3, ctx); | ||
sycl::free(Z4, ctx); | ||
sycl::free(Z5, ctx); | ||
sycl::free(Z6, ctx); | ||
sycl::free(Z7, ctx); | ||
sycl::free(Z8, ctx); | ||
|
||
return 0; | ||
} | ||
kbsmith-intel marked this conversation as resolved.
Show resolved
Hide resolved
|
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT: I think it can't happen in today's code the QueueBatchSize is equal to 0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is incorrect. In the queue itself, QueueBatchSize of 0 means - No Batching. Because I need to allow QueueBatchSize to be 1 when dynamic batching is being used, because the possibility is that dynamic batching can adjust QueueBatchSize down to as low as 1 if we are still doing too many partial batches. So, in the queue itself, QueueBatchSize == 0 means no batching, QueueBatchSize > 0 is just the current bacthing size, and the queue bool UseDynamicBatching controls whether dynamic batch adjustment ever changes QueueBatchSize up or down.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realized this, yeah. But the code (in _pi_queue constructor) seems to never set it to 0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On looking at the code in pi_level_zero.hpp, I see that you are correct. Batching turned off is now really
represented as QueueBatchSize == 1 and UseDynamicBatching== false, or to say it another way, batching is turned off by using a fixed batch size of 1. Thank you for pointing that out.