forked from pytorch/pytorch
-
Notifications
You must be signed in to change notification settings - Fork 0
Enable maxpool_2d in NNC #3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
Guobing-Chen
wants to merge
453
commits into
master
Choose a base branch
from
nnc_quant_op_maxpool2d
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
jgong5
reviewed
Jul 19, 2022
jgong5
approved these changes
Jul 19, 2022
29c55d7
to
e917fd4
Compare
This reduces boilerplate. Also, I plan to add a template parameter to ConvParams; without moving the methods onto the struct, I would have to manually template every method. Signed-off-by: Edward Z. Yang <[email protected]> Pull Request resolved: pytorch#89062 Approved by: https://github.com/SherlockNoMad
…h#89063) Signed-off-by: Edward Z. Yang <[email protected]> Pull Request resolved: pytorch#89063 Approved by: https://github.com/SherlockNoMad
…tor (pytorch#88859)" This reverts commit d60abe4. Reverted pytorch#88859 on behalf of https://github.com/kit1980 due to Broke Mac OS testing, which were clearly shown in CI
Now that periodic jobs are run under `mem_leak_check` mode with parallelization turning off. It's very easy for `linux-bionic-cuda11.6-py3-gcc7-slow-gradcheck / test` to timeout because one of the shards is very close to the 4h mark. * https://hud.pytorch.org/pytorch/pytorch/commit/2452e3f99a072760fc46d3f9025aaa37ca7ea2ab * https://hud.pytorch.org/pytorch/pytorch/commit/35e668b5ced25e735b6e523d557ed7fd60267914 Pull Request resolved: pytorch#89079 Approved by: https://github.com/clee2000
…ytorch#89066) This adds a unit test following the FSDP change in pytorch#88781. Pull Request resolved: pytorch#89066 Approved by: https://github.com/fegin
… call (pytorch#89029) # Summary Creates a callable native function that can determine which implementation of scaled dot product will get called. This allows to bump re-order the runtime dispatch of SDP to enable autograd. Pull Request resolved: pytorch#89029 Approved by: https://github.com/cpuhrsch
Pull Request resolved: pytorch#88956 Approved by: https://github.com/ezyang
Fixes pytorch#81254 Only easier to understand, not a real fix. Pull Request resolved: pytorch#81396 Approved by: https://github.com/fritzo, https://github.com/kit1980
…85624) When building products using PyTorch, it is often required to display license terms for all dependencies. The feature itself has been implemented in pytorch#81500 but it seems there are no options to enable it. This PR implements the option. cc/ @mattip @rgommers Pull Request resolved: pytorch#85624 Approved by: https://github.com/rgommers, https://github.com/seemethere
Summary: Support shape padding for aten.mm in Inductor (originally from [pytorch#88709](pytorch#88709)) Differential Revision: D41315078 Pull Request resolved: pytorch#89086 Approved by: https://github.com/jianyuh
Inductor test report artifacts are now on HUD but its files are in CSV format instead of the default XML files from pytest or unittest that we expect. So this PR uploads both suffixes Pull Request resolved: pytorch#89112 Approved by: https://github.com/desertfire
…ytorch#88549) This PR creates `torch.distributed._tensor` package and moves DeviceMesh, PlacementTypes to it part of pytorch#88838 Pull Request resolved: pytorch#88549 Approved by: https://github.com/fduwjj
…ed (pytorch#88176) This PR moves the core DTensor abstraction and high level APIs to torch.distributed._tensor folder, which includes the following: 1. DTensor class 2. high level APIs (distribute_tensor/module) 3. dispatching logic 4. redistribute logic part of pytorch#88838 Pull Request resolved: pytorch#88176 Approved by: https://github.com/fduwjj
…88177) This PR moves most DTensor ops to torch.distributed._tensor. We will add all tests in the following PRs. part of pytorch#88838 Pull Request resolved: pytorch#88177 Approved by: https://github.com/fduwjj
…orch#88550) This PR moves the view related DTensor ops to core distributed, tests will be add in follow up PRs part of pytorch#88838 Pull Request resolved: pytorch#88550 Approved by: https://github.com/fduwjj
…ch#88178) This PR moves DTensor basic tests to torch.distributed, including dtensor, device_mesh tests part of pytorch#88838 Pull Request resolved: pytorch#88178 Approved by: https://github.com/fduwjj
…88551) This PR moves DTensor op tests to core distributed, including prop_rule, pointwise op, matrix op tests, etc. part of pytorch#88838 Pull Request resolved: pytorch#88551 Approved by: https://github.com/aazzolini
…ytorch#88179) This PR moves remaining tests, i.e. tensor_ops, op db tests to core distributed part of pytorch#88838 Pull Request resolved: pytorch#88179 Approved by: https://github.com/aazzolini
…ted (pytorch#88180) This PR moves tensor/parallel folder and tests to torch.distributed. part of pytorch#88838 Pull Request resolved: pytorch#88180 Approved by: https://github.com/aazzolini
Pull Request resolved: pytorch#89095 Approved by: https://github.com/yanboliang, https://github.com/mlazos
…orch#88246)" This reverts commit 62ba15e. Reverted pytorch#88246 on behalf of https://github.com/DanilBaibak due to breaking internal builds
…orch#89118) Summary: Build fx-based linear/matmul/bmm + permute/transpose vertical fusion in Inductor For an internal Ads model: **1.15x -> 1.36x speedup** Test Plan: CI Reviewed By: bertmaher, jansel, jianyuh Differential Revision: D41071665 Pull Request resolved: pytorch#89118 Approved by: https://github.com/jianyuh
TODO: add an OpInfo Pull Request resolved: pytorch#88745 Approved by: https://github.com/ezyang
…ed-tests mode (pytorch#89454) When looking into Rockset data for disabled test unittest, for example `testAdd`, I see that it's re-run only 3 times instead of 50+ times as expected under rerun-disabled -test mode ``` [ { "name": "testAdd", "classname": "TestLazyReuseIr", "filename": "lazy/test_reuse_ir.py", "flaky": false, "num_green": 3, "num_red": 0 } ] ``` It turns out that I made a mistake mixing `RERUN_DISABLED_TESTS` and `report_only` into `(RERUN_DISABLED_TESTS or report_only) and num_retries_left < MAX_NUM_RETRIES` in pytorch#88646. The retrying logic for successful tests under rerun-disabled-tests mode is never executed because num_retries_left would be equal to MAX_NUM_RETRIES (not smaller) if the very first run successes. Thus, the sample test `testAdd` finishes right away (1 success count) * `report_only` and `RERUN_DISABLED_TESTS` are 2 different things and shouldn't be mixed together. RERUN_DISABLED_TESTS has the higher priority. * We also don't want to retry skipped tests under rerun-disabled-tests mode because they are only skipped due to `check_if_enable` check `Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run` ### Testing * CI https://github.com/pytorch/pytorch/actions/runs/3518228784 generates https://gha-artifacts.s3.amazonaws.com/pytorch/pytorch/3518228784/1/artifact/test-reports-test-default-4-4-linux.4xlarge.nvidia.gpu_9627285587.zip in which `testAdd` is correctly called multiple times and `TestLazyReuseIr` is skipped correctly * Locally ``` # export CI=1 # export PYTORCH_RETRY_TEST_CASES=1 # export PYTORCH_OVERRIDE_FLAKY_SIGNAL=1 # export PYTORCH_TEST_RERUN_DISABLED_TESTS=1 $ python test/run_test.py --verbose -i lazy/test_reuse_ir Ignoring disabled issues: [] Selected tests: lazy/test_reuse_ir Prioritized test from test file changes. reordering tests for PR: prioritized: [] the rest: ['lazy/test_reuse_ir'] Downloading https://raw.githubusercontent.com/pytorch/test-infra/generated-stats/stats/slow-tests.json to /Users/huydo/Storage/mine/pytorch/test/.pytorch-slow-tests.json Downloading https://raw.githubusercontent.com/pytorch/test-infra/generated-stats/stats/disabled-tests-condensed.json to /Users/huydo/Storage/mine/pytorch/test/.pytorch-disabled-tests.json parallel (file granularity) tests: lazy/test_reuse_ir serial (file granularity) tests: Ignoring disabled issues: [] Ignoring disabled issues: [] Running lazy/test_reuse_ir ... [2022-11-21 13:21:07.165877] Executing ['/Users/huydo/miniconda3/envs/py3.9/bin/python', '-bb', 'lazy/test_reuse_ir.py', '-v', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2022-11-21 13:21:07.166279] Expand the folded group to see the log file of lazy/test_reuse_ir ##[group]PRINTING LOG FILE of lazy/test_reuse_ir (/Users/huydo/Storage/mine/pytorch/test/test-reports/lazy-test_reuse_ir_6cf_dxa1) Running tests... ---------------------------------------------------------------------- Test results will be stored in test-reports/python-unittest/lazy.test_reuse_ir testAdd (__main__.TestLazyReuseIr) ... ok (1.215s) testAdd (__main__.TestLazyReuseIr) ... testAdd succeeded - num_retries_left: 50 ok (0.001s) testAdd (__main__.TestLazyReuseIr) ... testAdd succeeded - num_retries_left: 49 ok (0.001s) testAdd (__main__.TestLazyReuseIr) ... testAdd succeeded - num_retries_left: 48 ok (0.001s) testAdd (__main__.TestLazyReuseIr) ... testAdd succeeded - num_retries_left: 47 ok (0.001s) testAdd (__main__.TestLazyReuseIr) ... testAdd succeeded - num_retries_left: 46 ok (0.001s) testAdd (__main__.TestLazyReuseIr) ... testAdd succeeded - num_retries_left: 45 ok (0.001s) testAdd (__main__.TestLazyReuseIr) ... testAdd succeeded - num_retries_left: 44 ok (0.001s) testAdd (__main__.TestLazyReuseIr) ... testAdd succeeded - num_retries_left: 43 ok (0.001s) testAdd (__main__.TestLazyReuseIr) ... testAdd succeeded - num_retries_left: 42 ok (0.001s) testAdd (__main__.TestLazyReuseIr) ... testAdd succeeded - num_retries_left: 41 ok (0.001s) testAdd (__main__.TestLazyReuseIr) ... testAdd succeeded - num_retries_left: 40 ok (0.001s) testAdd (__main__.TestLazyReuseIr) ... testAdd succeeded - num_retries_left: 39 ok (0.001s) testAdd (__main__.TestLazyReuseIr) ... testAdd succeeded - num_retries_left: 38 ok (0.001s) testAdd (__main__.TestLazyReuseIr) ... testAdd succeeded - num_retries_left: 37 ok (0.001s) testAdd (__main__.TestLazyReuseIr) ... testAdd succeeded - num_retries_left: 36 ok (0.001s) testAdd (__main__.TestLazyReuseIr) ... testAdd succeeded - num_retries_left: 35 ok (0.001s) testAdd (__main__.TestLazyReuseIr) ... testAdd succeeded - num_retries_left: 34 ok (0.001s) testAdd (__main__.TestLazyReuseIr) ... testAdd succeeded - num_retries_left: 33 ok (0.001s) testAdd (__main__.TestLazyReuseIr) ... testAdd succeeded - num_retries_left: 32 ok (0.001s) testAdd (__main__.TestLazyReuseIr) ... testAdd succeeded - num_retries_left: 31 ok (0.001s) testAdd (__main__.TestLazyReuseIr) ... testAdd succeeded - num_retries_left: 30 ok (0.001s) testAdd (__main__.TestLazyReuseIr) ... testAdd succeeded - num_retries_left: 29 ok (0.001s) testAdd (__main__.TestLazyReuseIr) ... testAdd succeeded - num_retries_left: 28 ok (0.001s) testAdd (__main__.TestLazyReuseIr) ... testAdd succeeded - num_retries_left: 27 ok (0.001s) testAdd (__main__.TestLazyReuseIr) ... testAdd succeeded - num_retries_left: 26 ok (0.001s) testAdd (__main__.TestLazyReuseIr) ... testAdd succeeded - num_retries_left: 25 ok (0.001s) testAdd (__main__.TestLazyReuseIr) ... testAdd succeeded - num_retries_left: 24 ok (0.001s) testAdd (__main__.TestLazyReuseIr) ... testAdd succeeded - num_retries_left: 23 ok (0.001s) testAdd (__main__.TestLazyReuseIr) ... testAdd succeeded - num_retries_left: 22 ok (0.001s) testAdd (__main__.TestLazyReuseIr) ... testAdd succeeded - num_retries_left: 21 ok (0.001s) testAdd (__main__.TestLazyReuseIr) ... testAdd succeeded - num_retries_left: 20 ok (0.001s) testAdd (__main__.TestLazyReuseIr) ... testAdd succeeded - num_retries_left: 19 ok (0.001s) testAdd (__main__.TestLazyReuseIr) ... testAdd succeeded - num_retries_left: 18 ok (0.001s) testAdd (__main__.TestLazyReuseIr) ... testAdd succeeded - num_retries_left: 17 ok (0.001s) testAdd (__main__.TestLazyReuseIr) ... testAdd succeeded - num_retries_left: 16 ok (0.001s) testAdd (__main__.TestLazyReuseIr) ... testAdd succeeded - num_retries_left: 15 ok (0.001s) testAdd (__main__.TestLazyReuseIr) ... testAdd succeeded - num_retries_left: 14 ok (0.001s) testAdd (__main__.TestLazyReuseIr) ... testAdd succeeded - num_retries_left: 13 ok (0.001s) testAdd (__main__.TestLazyReuseIr) ... testAdd succeeded - num_retries_left: 12 ok (0.001s) testAdd (__main__.TestLazyReuseIr) ... testAdd succeeded - num_retries_left: 11 ok (0.001s) testAdd (__main__.TestLazyReuseIr) ... testAdd succeeded - num_retries_left: 10 ok (0.001s) testAdd (__main__.TestLazyReuseIr) ... testAdd succeeded - num_retries_left: 9 ok (0.001s) testAdd (__main__.TestLazyReuseIr) ... testAdd succeeded - num_retries_left: 8 ok (0.001s) testAdd (__main__.TestLazyReuseIr) ... testAdd succeeded - num_retries_left: 7 ok (0.001s) testAdd (__main__.TestLazyReuseIr) ... testAdd succeeded - num_retries_left: 6 ok (0.001s) testAdd (__main__.TestLazyReuseIr) ... testAdd succeeded - num_retries_left: 5 ok (0.001s) testAdd (__main__.TestLazyReuseIr) ... testAdd succeeded - num_retries_left: 4 ok (0.001s) testAdd (__main__.TestLazyReuseIr) ... testAdd succeeded - num_retries_left: 3 ok (0.001s) testAdd (__main__.TestLazyReuseIr) ... testAdd succeeded - num_retries_left: 2 ok (0.001s) testAdd (__main__.TestLazyReuseIr) ... testAdd succeeded - num_retries_left: 1 ok (0.001s) testAddSub (__main__.TestLazyReuseIr) ... testAdd succeeded - num_retries_left: 0 skip: Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run (0.001s) testAddSubFallback (__main__.TestLazyReuseIr) ... skip: Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run (0.001s) testBatchNorm (__main__.TestLazyReuseIr) ... skip: Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run (0.001s) ---------------------------------------------------------------------- Ran 54 tests in 1.264s OK (skipped=3) ``` Here is the sample rockset query ``` WITH added_row_number AS ( SELECT *, ROW_NUMBER() OVER(PARTITION BY name, classname, filename ORDER BY _event_time DESC) AS row_number FROM commons.rerun_disabled_tests ) SELECT name, classname, filename, flaky, num_green, num_red FROM added_row_number WHERE row_number = 1 AND name = 'testAdd' ``` Pull Request resolved: pytorch#89454 Approved by: https://github.com/clee2000
… core distributed (pytorch#89399) This PR moves dedup_tensors and its test to torch.distributed.checkpoint. This is a pre-req for enabling 2D checkpoint. This removes duplicated shards in list of SavePlan. It is used when saving DT with replicated placement. Docstring and comments will be added in the following PRs. Pull Request resolved: pytorch#89399 Approved by: https://github.com/wanchaol
…88904) In pytorch#87741 we added the inference support for dynamo/torchxla integration. Later on in pytorch#88449 we attempt to add the training support. That attempt is not smooth because - we try 2 things together 1. let dynamo trace the model on xla rather than eager 2. enable training - It turns out neither of these two tasks are trivial enough. Furthermore, item 2 (enable training) depends on item 1 (tracing on xla). We enable training via AOTAutograd. AOTAutograd lift all model parameters/buffers as graph inputs. Without item 1 being done, we would need copy all graph inputs (including model parameters/buffers) from eager device to xla devices. That hurts performance a lot. Have a cache to map eager parameter to XLA parameter does not solve the problem since the update on either will not sync automatically to the other. They will easily go out of sync. This PR let dynamo trace the model on XLA rather than eager. This is a preparation step to enabling training. Also, tracing on XLA makes the data movement more efficient. We see 1.5x geomean speedup compared to previous 1.38x. ``` +-------------------------+--------------------+-------------------------+ | Model | XLA (trace once) | XLA (trace everytime) | +=========================+====================+=========================+ | resnet18 | 1.38 | 1.008 | +-------------------------+--------------------+-------------------------+ | resnet50 | 1.227 | 0.998 | +-------------------------+--------------------+-------------------------+ | resnext50_32x4d | 1.544 | 1.008 | +-------------------------+--------------------+-------------------------+ | alexnet | 1.085 | 1.045 | +-------------------------+--------------------+-------------------------+ | mobilenet_v2 | 2.028 | 1.013 | +-------------------------+--------------------+-------------------------+ | mnasnet1_0 | 1.516 | 0.995 | +-------------------------+--------------------+-------------------------+ | squeezenet1_1 | 0.868 | 1.01 | +-------------------------+--------------------+-------------------------+ | vgg16 | 1.099 | 1.008 | +-------------------------+--------------------+-------------------------+ | BERT_pytorch | 3.26 | 1.027 | +-------------------------+--------------------+-------------------------+ | timm_vision_transformer | 2.182 | 1.015 | +-------------------------+--------------------+-------------------------+ | geomean | 1.50389 | 1.01261 | +-------------------------+--------------------+-------------------------+ ``` Example command ``` GPU_NUM_DEVICES=1 python benchmarks/dynamo/torchbench.py --randomize-input --performance --trace-on-xla --only resnet18 --backend=torchxla_trace_once ``` Pull Request resolved: pytorch#88904 Approved by: https://github.com/wconstab, https://github.com/JackCaoG, https://github.com/jansel
) Pull Request resolved: pytorch#89274 Approved by: https://github.com/jgong5, https://github.com/jansel
Reverts updates that were introduced by pytorch#89157 Pull Request resolved: pytorch#89449 Approved by: https://github.com/kit1980, https://github.com/huydhn, https://github.com/clee2000
…rch#89463) Summary: This permute copy change seems to be causing huge regressions on machines without AVX512. Revert to mitigate. This shouldn't be problematic since the improvement from changing it was super small anyways. Differential Revision: D41450088 Pull Request resolved: pytorch#89463 Approved by: https://github.com/hlu1
… distributed (pytorch#89398) This PR moves traverse and its test to torch.distributed.checkpoint. This is a pre-req for enabling 2D checkpoint. This is used when flatten nested dict and flatten sharded tensors. Docstring and comments will be added in the following PRs. Test: ``` python3 test/distributed/_tensor/parallel/test_2d_parallel.py ``` and CI Pull Request resolved: pytorch#89398 Approved by: https://github.com/wanchaol
…rm (pytorch#81761) (pytorch#84624) Reland pytorch#81761 Differential Revision: [D39332292](https://our.internmc.facebook.com/intern/diff/D39332292) Pull Request resolved: pytorch#84624 Approved by: https://github.com/kit1980
Pull Request resolved: pytorch#88990 Approved by: https://github.com/lezcano, https://github.com/peterbell10
Summary: Fix rounding issue in quantized shaders Test Plan: On Mac ``` cd ~/fbsource buck1 run -c pt.vulkan_full_precision=1 //xplat/caffe2:pt_vulkan_quantized_api_test_binAppleMac\#macosx-arm64 ``` On Android ``` cd ~/fbsource buck1 build -c ndk.custom_libcxx=false -c pt.enable_qpl=0 -c pt.vulkan_full_precision=1 //xplat/caffe2:pt_vulkan_quantized_api_test_binAndroid\#android-arm64 --show-output adb push buck-out/gen/xplat/caffe2/pt_vulkan_quantized_api_test_binAndroid\#android-arm64 /data/local/tmp/vulkan_quantized_api_test adb shell "/data/local/tmp/vulkan_quantized_api_test" ``` Reviewed By: salilsdesai Differential Revision: D41047095 Pull Request resolved: pytorch#89456 Approved by: https://github.com/kirklandsign, https://github.com/digantdesai
Fixes https://github.com/pytorch/torchdynamo/issues/1888 Signed-off-by: Edward Z. Yang <[email protected]> Differential Revision: [D41460986](https://our.internmc.facebook.com/intern/diff/D41460986) Pull Request resolved: pytorch#89464 Approved by: https://github.com/bdhirsh
pytorch#89317) Differential Revision: [D41415321](https://our.internmc.facebook.com/intern/diff/D41415321) Pull Request resolved: pytorch#89317 Approved by: https://github.com/kwen2501
…ch#89318) Differential Revision: [D41415324](https://our.internmc.facebook.com/intern/diff/D41415324) Pull Request resolved: pytorch#89318 Approved by: https://github.com/kwen2501
Summary: Fixes pytorch/torchdynamo#1797 Pull Request resolved: pytorch#89289 Approved by: https://github.com/ngimel, https://github.com/jansel, https://github.com/jgong5
…c quantization (pytorch#89248) Summary: split the is_decomposed logic for `_replace_observer_with_quantize_dequantize_node` in a separate function and added support for dynamic quantization in the decomposed version of this function. In case of dynamic quantization, we'll produce the following reference quantized pattern in decomposed mode: ``` x -> choose_qparams -> quantize_per_tensor -> dequantize_per_tensor -> linear ``` Test Plan: python test/test_quantization.py -k test__convert_to_reference_decomposed_fx_dynamic_quant Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: pytorch#89248 Approved by: https://github.com/vkuzo
Pull Request resolved: pytorch#89430 Approved by: https://github.com/ezyang
Fixes - T137631262 Caching conda dependencies for build workflows. Conda dependencies have been gathered from the workflow https://github.com/pytorch/pytorch/blob/master/.github/workflows/_buck-build-test.yml The pull request updates the action from `conda-incubator/setup-miniconda@v2` to `pytorch/test-infra/.github/actions/setup-miniconda@main` as it supports caching. Test Plan: Running the `ciflow/periodic` which runs the ci builds `buck-build-test` workflow. Expected output is to have all the conda dependencies cached. <img width="1227" alt="Screenshot 2022-11-22 at 15 44 20" src="https://user-images.githubusercontent.com/15447437/203343298-e55c384b-01ad-45c3-a5e9-ba5c53149be4.png"> Pull Request resolved: pytorch#89422 Approved by: https://github.com/huydhn
…orch#88089) This fixes some prod and masked.prod tests on Windows. np.prod uses int32 on Windows so it overflows. On Linux it uses by default int64. Fixes pytorch#77305 Fixes pytorch#77320 Fixes pytorch#77334 Fixes pytorch#77335 Fixes pytorch#77336 Fixes pytorch#77337 Pull Request resolved: pytorch#88089 Approved by: https://github.com/mruberry
Fixes #ISSUE_NUMBER Pull Request resolved: pytorch#89509 Approved by: https://github.com/ngimel, https://github.com/shunting314
Enables previously failing UCC distributed_test.py tests that are now fixed due to either ProcessGroupUCC barrier blocking fix (pytorch#86961) or UCC-side timeout error handling fix: (https://github.com/openucx/ucc/pull/679/files). Bump upstream UCC version to build UCC with timeout error handling fix merged in. Pull Request resolved: pytorch#89023 Approved by: https://github.com/kwen2501, https://github.com/malfet
Fixes #ISSUE_NUMBER Pull Request resolved: pytorch#89493 Approved by: https://github.com/H-Huang
Test still fails when run on 5 A100 GPUs, although it works with 5 V100s. Using 4 GPUs seems to be fine. Followup to pytorch#85957 Pull Request resolved: pytorch#86280 Approved by: https://github.com/awgu, https://github.com/kit1980
The test may fail due to slightly different values caused by different order of matrizes in SGEMM: > Mismatched elements: 1 / 50 (2.0%) > Greatest absolute difference: 1.430511474609375e-05 at index (4, 5) (up to 1e-05 allowed) > Greatest relative difference: 4.65393206065873e-06 at index (4, 5) (up to 1.3e-06 allowed) Observed on POWER (ppc64le) Pull Request resolved: pytorch#86365 Approved by: https://github.com/mruberry, https://github.com/kit1980
Replace the remaining hand-written code in vec256_float_vsx.h by calls to Sleef functions similar to what was done in pytorch#59382 & pytorch#82646 after pytorch#41541 This fixes wrong results for e.g. `sin(1e20)`. Fixes pytorch#85978 To fix pytorch#85978 I only needed to do the sin/cos functions to make the test pass but to not encounter the same issue again and again (see the previous PRs and issues) I checked the whole file for similar functions where a Sleef function could be used and changed those too. In the diff I've noticed the faulty whitespace so to make this complete I fixed that too, so it should now be done. Pull Request resolved: pytorch#86453 Approved by: https://github.com/malfet
Add commit date to build summary of dashboard. Make the date of the run reflective of when the run started, not when the run ended. Use PST (UTC -8) to determine day, rather than GMT (UTC +0). Test comment: pytorch/torchdynamo#1831 (comment) Pull Request resolved: pytorch#89517 Approved by: https://github.com/anijain2305
Fixes #ISSUE_NUMBER Pull Request resolved: pytorch#89455 Approved by: https://github.com/huydhn
We observed that the native PyTorch LayerNormBackwardKernelImplInternal has suboptimal performance for certain input sizes on AMD GPUs especially when `fs` (=`config_m` in our benchmark script) is large and `bs` (=`config_n` in our benchmark script) is small (commonly seen in [the CvT model](https://arxiv.org/abs/2103.15808)) in the benchmark script of [PR pytorch#68238](pytorch#68238 (comment)) on AMD GPUs. This PR is to replace `GammaBetaBackwardCUDAKernel` with the Apex layernorm backward kernel with some ROCm-specific parameter tuning when `fs` (=`config_m`) is larger than 512 on AMD GPUs. There are a few PRs for LayerNorm kernel: - pytorch#26201 - pytorch#27634 - pytorch#68238 Therefore, we have tested and compared the kernel before and at this PR with the input shapes in the last two PRs along with those commonly used in the CvT model on AMD MI100. --- **Current** <html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40"> <head> <meta name=ProgId content=Excel.Sheet> <meta name=Generator content="Microsoft Excel 15"> <link id=Main-File rel=Main-File href="file:///C:/Users/hubertlu/AppData/Local/Temp/msohtmlclip1/01/clip.htm"> <link rel=File-List href="file:///C:/Users/hubertlu/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml"> <!--table {mso-displayed-decimal-separator:"\."; mso-displayed-thousand-separator:"\,";} @page {mso-header-data:"&L&\0022Arial\0022&10&K0000FF \[AMD Official Use Only - General\]&1\#\000D"; margin:.75in .7in .75in .7in; mso-header-margin:.3in; mso-footer-margin:.3in;} tr {mso-height-source:auto;} col {mso-width-source:auto;} br {mso-data-placement:same-cell;} td {padding-top:1px; padding-right:1px; padding-left:1px; mso-ignore:padding; color:black; font-size:11.0pt; font-weight:400; font-style:normal; text-decoration:none; font-family:Calibri, sans-serif; mso-font-charset:0; mso-number-format:General; text-align:general; vertical-align:bottom; border:none; mso-background-source:auto; mso-pattern:auto; mso-protection:locked visible; white-space:nowrap; mso-rotate:0;} --> </head> <body link="#0563C1" vlink="#954F72"> M | N | fwd (half) | fwdbwd (half) | fwd (float) | fwdbwd (float) -- | -- | -- | -- | -- | -- 50432 | 384 | 0.387256 | 1.372758 | 0.378975 | 1.47892 50176 | 384 | 0.38231 | 1.362416 | 0.378084 | 1.473886 200704 | 192 | 0.997859 | 4.315875 | 0.989306 | 4.560827 802816 | 64 | 3.671828 | 16.68013 | 3.613515 | 16.827946 200 | 256 | 0.066503 | 0.332096 | 0.071422 | 0.325349 1000 | 256 | 0.071848 | 0.333355 | 0.073038 | 0.334753 6000 | 256 | 0.086334 | 0.345139 | 0.086834 | 0.347429 6272 | 256 | 0.088601 | 0.347906 | 0.087855 | 0.351245 200 | 512 | 0.071626 | 0.329726 | 0.073798 | 0.326878 1000 | 512 | 0.073975 | 0.330226 | 0.074166 | 0.332751 6000 | 512 | 0.099617 | 0.362367 | 0.100095 | 0.378313 6272 | 512 | 0.100378 | 0.358066 | 0.099857 | 0.395982 200 | 1024 | 0.072954 | 0.326382 | 0.073899 | 0.333007 1000 | 1024 | 0.0743 | 0.325532 | 0.071126 | 0.330991 6000 | 1024 | 0.127025 | 0.390084 | 0.128692 | 0.471504 6272 | 1024 | 0.130704 | 0.403536 | 0.135244 | 0.487133 200 | 1536 | 0.070331 | 0.339169 | 0.070086 | 0.331015 1000 | 1536 | 0.075085 | 0.330042 | 0.076295 | 0.328778 6000 | 1536 | 0.148889 | 0.44949 | 0.155781 | 0.659987 6272 | 1536 | 0.154939 | 0.478871 | 0.17673 | 0.716025 200 | 2048 | 0.070269 | 0.335585 | 0.072804 | 0.334655 1000 | 2048 | 0.080094 | 0.326991 | 0.080426 | 0.32685 6000 | 2048 | 0.187888 | 0.623023 | 0.245762 | 0.981635 6272 | 2048 | 0.195431 | 0.65244 | 0.262574 | 1.008141 200 | 3072 | 0.068205 | 0.339428 | 0.073068 | 0.344034 1000 | 3072 | 0.087554 | 0.328899 | 0.09218 | 0.346433 6000 | 3072 | 0.240352 | 0.905058 | 0.368135 | 1.280462 6272 | 3072 | 0.26179 | 0.959387 | 0.387782 | 1.476524 128 | 2097152 | 5.905976 | 22.724793 | 10.287974 | 30.242092 256 | 1048576 | 4.561596 | 19.554308 | 10.223171 | 29.42371 512 | 524288 | 4.146751 | 22.7247 | 11.404285 | 39.175902 1024 | 262144 | 5.193135 | 23.403325 | 11.334512 | 38.947192 2048 | 131072 | 4.992907 | 23.377801 | 11.400286 | 40.889191 4096 | 65536 | 5.429488 | 24.275701 | 11.196778 | 41.4751 8192 | 32768 | 5.35758 | 21.360312 | 10.535418 | 42.875646 16384 | 16384 | 5.44947 | 20.852605 | 10.357685 | 34.603408 32768 | 8192 | 4.688925 | 17.379392 | 9.635596 | 31.188271 </body> </html> --------- **At this PR** <html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40"> <head> <meta name=ProgId content=Excel.Sheet> <meta name=Generator content="Microsoft Excel 15"> <link id=Main-File rel=Main-File href="file:///C:/Users/hubertlu/AppData/Local/Temp/msohtmlclip1/01/clip.htm"> <link rel=File-List href="file:///C:/Users/hubertlu/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml"> <!--table {mso-displayed-decimal-separator:"\."; mso-displayed-thousand-separator:"\,";} @page {mso-header-data:"&L&\0022Arial\0022&10&K0000FF \[AMD Official Use Only - General\]&1\#\000D"; margin:.75in .7in .75in .7in; mso-header-margin:.3in; mso-footer-margin:.3in;} tr {mso-height-source:auto;} col {mso-width-source:auto;} br {mso-data-placement:same-cell;} td {padding-top:1px; padding-right:1px; padding-left:1px; mso-ignore:padding; color:black; font-size:11.0pt; font-weight:400; font-style:normal; text-decoration:none; font-family:Calibri, sans-serif; mso-font-charset:0; mso-number-format:General; text-align:general; vertical-align:bottom; border:none; mso-background-source:auto; mso-pattern:auto; mso-protection:locked visible; white-space:nowrap; mso-rotate:0;} .xl63 {color:windowtext;} --> </head> <body link="#0563C1" vlink="#954F72"> M | N | fwd (half) | fwdbwd (half) | fwd (float) | fwdbwd (float) -- | -- | -- | -- | -- | -- 50432 | 384 | 0.38797 | 0.93103 | 0.37966 | 1.15283 50176 | 384 | 0.3874 | 0.96417 | 0.38462 | 1.18595 200704 | 192 | 1.00002 | 2.40876 | 0.99224 | 2.55579 802816 | 64 | 3.67348 | 7.98658 | 3.61871 | 7.72404 200 | 256 | 0.07292 | 0.35119 | 0.07195 | 0.32602 1000 | 256 | 0.07354 | 0.33325 | 0.07237 | 0.33742 6000 | 256 | 0.08819 | 0.33283 | 0.08453 | 0.3279 6272 | 256 | 0.0886 | 0.33446 | 0.08774 | 0.33426 200 | 512 | 0.0701 | 0.33505 | 0.07072 | 0.33018 1000 | 512 | 0.07042 | 0.33442 | 0.074 | 0.33206 6000 | 512 | 0.09931 | 0.34956 | 0.09895 | 0.3572 6272 | 512 | 0.10103 | 0.32976 | 0.10041 | 0.36635 200 | 1024 | 0.07144 | 0.33579 | 0.07209 | 0.33216 1000 | 1024 | 0.0736 | 0.32803 | 0.07286 | 0.32936 6000 | 1024 | 0.12584 | 0.38916 | 0.12852 | 0.48273 6272 | 1024 | 0.13053 | 0.38804 | 0.13464 | 0.49545 200 | 1536 | 0.07159 | 0.3396 | 0.07062 | 0.33545 1000 | 1536 | 0.07443 | 0.33239 | 0.07366 | 0.33204 6000 | 1536 | 0.14959 | 0.45043 | 0.15826 | 0.69119 6272 | 1536 | 0.1542 | 0.47644 | 0.18249 | 0.72208 200 | 2048 | 0.07258 | 0.33982 | 0.07412 | 0.33859 1000 | 2048 | 0.0793 | 0.32816 | 0.07864 | 0.32583 6000 | 2048 | 0.18973 | 0.571 | 0.25506 | 0.91796 6272 | 2048 | 0.19719 | 0.64208 | 0.26445 | 0.95055 200 | 3072 | 0.07092 | 0.33867 | 0.07104 | 0.34695 1000 | 3072 | 0.08727 | 0.33144 | 0.09144 | 0.36633 6000 | 3072 | 0.24683 | 0.87275 | 0.37761 | 1.3289 6272 | 3072 | 0.26437 | 0.91178 | 0.38496 | 1.53694 128 | 2097152 | 6.27936 | 23.69425 | 10.40004 | 30.13699 256 | 1048576 | 4.5404 | 19.47675 | 10.28494 | 29.36936 512 | 524288 | 4.13951 | 18.78771 | 10.09557 | 32.67083 1024 | 262144 | 4.47576 | 18.00411 | 9.56488 | 31.47117 2048 | 131072 | 4.28026 | 16.95619 | 9.40297 | 30.82845 4096 | 65536 | 4.2653 | 16.5018 | 9.03315 | 30.08392 8192 | 32768 | 4.25613 | 16.13583 | 8.9258 | 30.75296 16384 | 16384 | 4.20256 | 16.38207 | 9.52587 | 31.31113 32768 | 8192 | 4.20231 | 16.19452 | 9.31478 | 31.03514 </body> </html> --------- **Performance Improvement (%)** <html xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:dt="uuid:C2F41010-65B3-11d1-A29F-00AA00C14882" xmlns="http://www.w3.org/TR/REC-html40"> <head> <meta name=ProgId content=OneNote.File> <meta name=Generator content="Microsoft OneNote 15"> </head> <body lang=en-US style='font-family:Calibri;font-size:11.0pt'> <!--StartFragment--> <div style='direction:ltr'> M | N | fwdbwd, torch.float16 | fwdbwd, torch.float32 -- | -- | -- | -- 50432 | 384 | 32.178 | 22.049 50176 | 384 | 29.231 | 19.536 200704 | 192 | 44.188 | 43.962 802816 | 64 | 52.119 | 54.100 200 | 256 | -5.750 | -0.206 1000 | 256 | 0.031 | -0.797 6000 | 256 | 3.566 | 5.621 6272 | 256 | 3.865 | 4.836 200 | 512 | -1.615 | -1.010 1000 | 512 | -1.270 | 0.208 6000 | 512 | 3.534 | 5.581 6272 | 512 | 7.905 | 7.483 200 | 1024 | -2.883 | 0.254 1000 | 1024 | -0.767 | 0.493 6000 | 1024 | 0.237 | -2.381 6272 | 1024 | 3.840 | -1.707 200 | 1536 | -0.127 | -1.340 1000 | 1536 | -0.711 | -0.992 6000 | 1536 | -0.209 | -4.728 6272 | 1536 | 0.508 | -0.846 200 | 2048 | -1.262 | -1.176 1000 | 2048 | -0.358 | 0.312 6000 | 2048 | 8.350 | 6.487 6272 | 2048 | 1.588 | 5.713 200 | 3072 | 0.223 | -0.848 1000 | 3072 | -0.773 | -5.743 6000 | 3072 | 3.570 | -3.783 6272 | 3072 | 4.962 | -4.092 128 | 2097152 | -4.266 | 0.348 256 | 1048576 | 0.397 | 0.185 512 | 524288 | 17.325 | 16.605 1024 | 262144 | 23.070 | 19.195 2048 | 131072 | 27.469 | 24.605 4096 | 65536 | 32.023 | 27.465 8192 | 32768 | 24.459 | 28.274 16384 | 16384 | 21.439 | 9.514 32768 | 8192 | 6.818 | 0.491 </div> <!--EndFragment--> </body> </html> --------- **Benchmark script of this PR** ``` # Ref: # 1. pytorch#26201 # 2. pytorch#68238 from distutils.command.config import config import torch from torch.nn import LayerNorm import timeit number_runs = 1000 # TODO: Modify this to save time! def test_forward(layer_norm_cuda, input_cuda): layer_norm_cuda(input_cuda); torch.cuda.synchronize() def test_backward(out_cuda, layer_norm_grad_cuda, create_graph): out_cuda.backward(layer_norm_grad_cuda, retain_graph=True, create_graph=create_graph); torch.cuda.synchronize() def test_fwdbwd(input_cuda, layer_norm_cuda, gO): input_cuda.grad = None layer_norm_cuda.zero_grad(set_to_none=True) out = layer_norm_cuda(input_cuda) out.backward(gO) torch.cuda.synchronize() def benchmark(config_m, config_n): print("M | N | fwd (half) | fwdbwd (half) | fwd (float) | fwdbwd (float)") if len(config_m) != len(config_n): print("Please make sure the lengths of config_m and config_m are the same.") for i in range(len(config_m)): normalized_shape = config_n[i] results = [config_m[i], config_n[i]] for dtype in (torch.half, torch.float): if dtype == torch.half: layer_norm_cuda = LayerNorm(normalized_shape).half().cuda() else: layer_norm_cuda = LayerNorm(normalized_shape).cuda() input_cuda = torch.randn(config_m[i], config_n[i], device='cuda', dtype=dtype, requires_grad=True) # print("cuda forward:") result_fwd = timeit.timeit(lambda: test_forward(layer_norm_cuda, input_cuda), number=number_runs) results.append(result_fwd / number_runs * 1000) gO = torch.rand_like(input_cuda) result_fwdbwd = timeit.timeit(lambda: test_fwdbwd(input_cuda, layer_norm_cuda, gO), number=number_runs) results.append(result_fwdbwd / number_runs * 1000) print('{:09d}|{:09d}|{:9.5f}|{:9.5f}|{:9.5f}|{:9.5f}'.format(results[0], results[1], results[2], results[3], results[4], results[5])) print("Times are in microseconds (us).") # CVT config_m_cvt = [50432, 50176, 200704, 802816] config_n_cvt = [384, 384, 192, 64] # pytorch#68238 (comment) config_m_68238 = [200, 1000, 6000, 6272, 200, 1000, 6000, 6272, 200, 1000, 6000, 6272, 200, 1000, 6000, 6272, 200, 1000, 6000, 6272, 200, 1000, 6000, 6272] config_n_68238 = [256,256,256,256,512,512,512,512,1024,1024,1024,1024,1536,1536,1536,1536,2048,2048,2048,2048,3072,3072,3072,3072] # pytorch#27634 config_m_27634 = [128, 256, 512, 1024, 2048, 4096, 8192, 16384, 32768] config_n_27634 = [2097152, 1048576, 524288, 262144, 131072, 65536, 32768, 16384, 8192] config_m = config_m_cvt + config_m_68238 + config_m_27634 config_n = config_n_cvt + config_n_68238 + config_n_27634 benchmark(config_m, config_n) ``` CC: @jeffdaily Pull Request resolved: pytorch#87635 Approved by: https://github.com/jataylo, https://github.com/jeffdaily, https://github.com/ezyang
With both quantization/non-quantization supported.
5ae3a1d
to
aadf974
Compare
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR implemented maxpool2d for NNC, and is also another follow-up PR to enable quantization/channelsLast at OP-level.