Enable maxpool_2d in NNC #3

Guobing-Chen · 2022-07-18T08:15:48Z

This PR implemented maxpool2d for NNC, and is also another follow-up PR to enable quantization/channelsLast at OP-level.

maxpool2d NNC lowering function implementation and lowering path enabling:

torch/csrc/jit/tensorexpr/operators/reduction.cpp
torch/csrc/jit/tensorexpr/operators/reduction.h
torch/csrc/jit/tensorexpr/lowerings.cpp

maxpool2d NNC external call implementations for both default and out version

torch/csrc/jit/tensorexpr/external_functions.cpp
torch/csrc/jit/tensorexpr/codegen.cpp

Add related test case for quantization and non-quantization scenarios:

test/cpp/tensorexpr/test_quantization.cpp

torch/csrc/jit/tensorexpr/external_functions.cpp

This reduces boilerplate. Also, I plan to add a template parameter to ConvParams; without moving the methods onto the struct, I would have to manually template every method. Signed-off-by: Edward Z. Yang <[email protected]> Pull Request resolved: pytorch#89062 Approved by: https://github.com/SherlockNoMad

…h#89063) Signed-off-by: Edward Z. Yang <[email protected]> Pull Request resolved: pytorch#89063 Approved by: https://github.com/SherlockNoMad

…tor (pytorch#88859)" This reverts commit d60abe4. Reverted pytorch#88859 on behalf of https://github.com/kit1980 due to Broke Mac OS testing, which were clearly shown in CI

Now that periodic jobs are run under `mem_leak_check` mode with parallelization turning off. It's very easy for `linux-bionic-cuda11.6-py3-gcc7-slow-gradcheck / test` to timeout because one of the shards is very close to the 4h mark. * https://hud.pytorch.org/pytorch/pytorch/commit/2452e3f99a072760fc46d3f9025aaa37ca7ea2ab * https://hud.pytorch.org/pytorch/pytorch/commit/35e668b5ced25e735b6e523d557ed7fd60267914 Pull Request resolved: pytorch#89079 Approved by: https://github.com/clee2000

…ytorch#89066) This adds a unit test following the FSDP change in pytorch#88781. Pull Request resolved: pytorch#89066 Approved by: https://github.com/fegin

… call (pytorch#89029) # Summary Creates a callable native function that can determine which implementation of scaled dot product will get called. This allows to bump re-order the runtime dispatch of SDP to enable autograd. Pull Request resolved: pytorch#89029 Approved by: https://github.com/cpuhrsch

Pull Request resolved: pytorch#88956 Approved by: https://github.com/ezyang

Fixes pytorch#81254 Only easier to understand, not a real fix. Pull Request resolved: pytorch#81396 Approved by: https://github.com/fritzo, https://github.com/kit1980

@mattip

…85624) When building products using PyTorch, it is often required to display license terms for all dependencies. The feature itself has been implemented in pytorch#81500 but it seems there are no options to enable it. This PR implements the option. cc/ @mattip @rgommers Pull Request resolved: pytorch#85624 Approved by: https://github.com/rgommers, https://github.com/seemethere

Summary: Support shape padding for aten.mm in Inductor (originally from [pytorch#88709](pytorch#88709)) Differential Revision: D41315078 Pull Request resolved: pytorch#89086 Approved by: https://github.com/jianyuh

Inductor test report artifacts are now on HUD but its files are in CSV format instead of the default XML files from pytest or unittest that we expect. So this PR uploads both suffixes Pull Request resolved: pytorch#89112 Approved by: https://github.com/desertfire

…ytorch#88549) This PR creates `torch.distributed._tensor` package and moves DeviceMesh, PlacementTypes to it part of pytorch#88838 Pull Request resolved: pytorch#88549 Approved by: https://github.com/fduwjj

…ed (pytorch#88176) This PR moves the core DTensor abstraction and high level APIs to torch.distributed._tensor folder, which includes the following: 1. DTensor class 2. high level APIs (distribute_tensor/module) 3. dispatching logic 4. redistribute logic part of pytorch#88838 Pull Request resolved: pytorch#88176 Approved by: https://github.com/fduwjj

…88177) This PR moves most DTensor ops to torch.distributed._tensor. We will add all tests in the following PRs. part of pytorch#88838 Pull Request resolved: pytorch#88177 Approved by: https://github.com/fduwjj

…orch#88550) This PR moves the view related DTensor ops to core distributed, tests will be add in follow up PRs part of pytorch#88838 Pull Request resolved: pytorch#88550 Approved by: https://github.com/fduwjj

…ch#88178) This PR moves DTensor basic tests to torch.distributed, including dtensor, device_mesh tests part of pytorch#88838 Pull Request resolved: pytorch#88178 Approved by: https://github.com/fduwjj

…88551) This PR moves DTensor op tests to core distributed, including prop_rule, pointwise op, matrix op tests, etc. part of pytorch#88838 Pull Request resolved: pytorch#88551 Approved by: https://github.com/aazzolini

…ytorch#88179) This PR moves remaining tests, i.e. tensor_ops, op db tests to core distributed part of pytorch#88838 Pull Request resolved: pytorch#88179 Approved by: https://github.com/aazzolini

…ted (pytorch#88180) This PR moves tensor/parallel folder and tests to torch.distributed. part of pytorch#88838 Pull Request resolved: pytorch#88180 Approved by: https://github.com/aazzolini

Pull Request resolved: pytorch#89095 Approved by: https://github.com/yanboliang, https://github.com/mlazos

…orch#88246)" This reverts commit 62ba15e. Reverted pytorch#88246 on behalf of https://github.com/DanilBaibak due to breaking internal builds

…orch#89118) Summary: Build fx-based linear/matmul/bmm + permute/transpose vertical fusion in Inductor For an internal Ads model: **1.15x -> 1.36x speedup** Test Plan: CI Reviewed By: bertmaher, jansel, jianyuh Differential Revision: D41071665 Pull Request resolved: pytorch#89118 Approved by: https://github.com/jianyuh

TODO: add an OpInfo Pull Request resolved: pytorch#88745 Approved by: https://github.com/ezyang

…ed-tests mode (pytorch#89454) When looking into Rockset data for disabled test unittest, for example `testAdd`, I see that it's re-run only 3 times instead of 50+ times as expected under rerun-disabled -test mode ``` [ { "name": "testAdd", "classname": "TestLazyReuseIr", "filename": "lazy/test_reuse_ir.py", "flaky": false, "num_green": 3, "num_red": 0 } ] ``` It turns out that I made a mistake mixing `RERUN_DISABLED_TESTS` and `report_only` into `(RERUN_DISABLED_TESTS or report_only) and num_retries_left < MAX_NUM_RETRIES` in pytorch#88646. The retrying logic for successful tests under rerun-disabled-tests mode is never executed because num_retries_left would be equal to MAX_NUM_RETRIES (not smaller) if the very first run successes. Thus, the sample test `testAdd` finishes right away (1 success count) * `report_only` and `RERUN_DISABLED_TESTS` are 2 different things and shouldn't be mixed together. RERUN_DISABLED_TESTS has the higher priority. * We also don't want to retry skipped tests under rerun-disabled-tests mode because they are only skipped due to `check_if_enable` check `Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run` ### Testing * CI https://github.com/pytorch/pytorch/actions/runs/3518228784 generates https://gha-artifacts.s3.amazonaws.com/pytorch/pytorch/3518228784/1/artifact/test-reports-test-default-4-4-linux.4xlarge.nvidia.gpu_9627285587.zip in which `testAdd` is correctly called multiple times and `TestLazyReuseIr` is skipped correctly * Locally ``` # export CI=1 # export PYTORCH_RETRY_TEST_CASES=1 # export PYTORCH_OVERRIDE_FLAKY_SIGNAL=1 # export PYTORCH_TEST_RERUN_DISABLED_TESTS=1 $ python test/run_test.py --verbose -i lazy/test_reuse_ir Ignoring disabled issues: [] Selected tests: lazy/test_reuse_ir Prioritized test from test file changes. reordering tests for PR: prioritized: [] the rest: ['lazy/test_reuse_ir'] Downloading https://raw.githubusercontent.com/pytorch/test-infra/generated-stats/stats/slow-tests.json to /Users/huydo/Storage/mine/pytorch/test/.pytorch-slow-tests.json Downloading https://raw.githubusercontent.com/pytorch/test-infra/generated-stats/stats/disabled-tests-condensed.json to /Users/huydo/Storage/mine/pytorch/test/.pytorch-disabled-tests.json parallel (file granularity) tests: lazy/test_reuse_ir serial (file granularity) tests: Ignoring disabled issues: [] Ignoring disabled issues: [] Running lazy/test_reuse_ir ... [2022-11-21 13:21:07.165877] Executing ['/Users/huydo/miniconda3/envs/py3.9/bin/python', '-bb', 'lazy/test_reuse_ir.py', '-v', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2022-11-21 13:21:07.166279] Expand the folded group to see the log file of lazy/test_reuse_ir ##[group]PRINTING LOG FILE of lazy/test_reuse_ir (/Users/huydo/Storage/mine/pytorch/test/test-reports/lazy-test_reuse_ir_6cf_dxa1) Running tests... ---------------------------------------------------------------------- Test results will be stored in test-reports/python-unittest/lazy.test_reuse_ir testAdd (__main__.TestLazyReuseIr) ... ok (1.215s) testAdd (__main__.TestLazyReuseIr) ... testAdd succeeded - num_retries_left: 50 ok (0.001s) testAdd (__main__.TestLazyReuseIr) ... testAdd succeeded - num_retries_left: 49 ok (0.001s) testAdd (__main__.TestLazyReuseIr) ... testAdd succeeded - num_retries_left: 48 ok (0.001s) testAdd (__main__.TestLazyReuseIr) ... testAdd succeeded - num_retries_left: 47 ok (0.001s) testAdd (__main__.TestLazyReuseIr) ... testAdd succeeded - num_retries_left: 46 ok (0.001s) testAdd (__main__.TestLazyReuseIr) ... testAdd succeeded - num_retries_left: 45 ok (0.001s) testAdd (__main__.TestLazyReuseIr) ... testAdd succeeded - num_retries_left: 44 ok (0.001s) testAdd (__main__.TestLazyReuseIr) ... testAdd succeeded - num_retries_left: 43 ok (0.001s) testAdd (__main__.TestLazyReuseIr) ... testAdd succeeded - num_retries_left: 42 ok (0.001s) testAdd (__main__.TestLazyReuseIr) ... testAdd succeeded - num_retries_left: 41 ok (0.001s) testAdd (__main__.TestLazyReuseIr) ... testAdd succeeded - num_retries_left: 40 ok (0.001s) testAdd (__main__.TestLazyReuseIr) ... testAdd succeeded - num_retries_left: 39 ok (0.001s) testAdd (__main__.TestLazyReuseIr) ... testAdd succeeded - num_retries_left: 38 ok (0.001s) testAdd (__main__.TestLazyReuseIr) ... testAdd succeeded - num_retries_left: 37 ok (0.001s) testAdd (__main__.TestLazyReuseIr) ... testAdd succeeded - num_retries_left: 36 ok (0.001s) testAdd (__main__.TestLazyReuseIr) ... testAdd succeeded - num_retries_left: 35 ok (0.001s) testAdd (__main__.TestLazyReuseIr) ... testAdd succeeded - num_retries_left: 34 ok (0.001s) testAdd (__main__.TestLazyReuseIr) ... testAdd succeeded - num_retries_left: 33 ok (0.001s) testAdd (__main__.TestLazyReuseIr) ... testAdd succeeded - num_retries_left: 32 ok (0.001s) testAdd (__main__.TestLazyReuseIr) ... testAdd succeeded - num_retries_left: 31 ok (0.001s) testAdd (__main__.TestLazyReuseIr) ... testAdd succeeded - num_retries_left: 30 ok (0.001s) testAdd (__main__.TestLazyReuseIr) ... testAdd succeeded - num_retries_left: 29 ok (0.001s) testAdd (__main__.TestLazyReuseIr) ... testAdd succeeded - num_retries_left: 28 ok (0.001s) testAdd (__main__.TestLazyReuseIr) ... testAdd succeeded - num_retries_left: 27 ok (0.001s) testAdd (__main__.TestLazyReuseIr) ... testAdd succeeded - num_retries_left: 26 ok (0.001s) testAdd (__main__.TestLazyReuseIr) ... testAdd succeeded - num_retries_left: 25 ok (0.001s) testAdd (__main__.TestLazyReuseIr) ... testAdd succeeded - num_retries_left: 24 ok (0.001s) testAdd (__main__.TestLazyReuseIr) ... testAdd succeeded - num_retries_left: 23 ok (0.001s) testAdd (__main__.TestLazyReuseIr) ... testAdd succeeded - num_retries_left: 22 ok (0.001s) testAdd (__main__.TestLazyReuseIr) ... testAdd succeeded - num_retries_left: 21 ok (0.001s) testAdd (__main__.TestLazyReuseIr) ... testAdd succeeded - num_retries_left: 20 ok (0.001s) testAdd (__main__.TestLazyReuseIr) ... testAdd succeeded - num_retries_left: 19 ok (0.001s) testAdd (__main__.TestLazyReuseIr) ... testAdd succeeded - num_retries_left: 18 ok (0.001s) testAdd (__main__.TestLazyReuseIr) ... testAdd succeeded - num_retries_left: 17 ok (0.001s) testAdd (__main__.TestLazyReuseIr) ... testAdd succeeded - num_retries_left: 16 ok (0.001s) testAdd (__main__.TestLazyReuseIr) ... testAdd succeeded - num_retries_left: 15 ok (0.001s) testAdd (__main__.TestLazyReuseIr) ... testAdd succeeded - num_retries_left: 14 ok (0.001s) testAdd (__main__.TestLazyReuseIr) ... testAdd succeeded - num_retries_left: 13 ok (0.001s) testAdd (__main__.TestLazyReuseIr) ... testAdd succeeded - num_retries_left: 12 ok (0.001s) testAdd (__main__.TestLazyReuseIr) ... testAdd succeeded - num_retries_left: 11 ok (0.001s) testAdd (__main__.TestLazyReuseIr) ... testAdd succeeded - num_retries_left: 10 ok (0.001s) testAdd (__main__.TestLazyReuseIr) ... testAdd succeeded - num_retries_left: 9 ok (0.001s) testAdd (__main__.TestLazyReuseIr) ... testAdd succeeded - num_retries_left: 8 ok (0.001s) testAdd (__main__.TestLazyReuseIr) ... testAdd succeeded - num_retries_left: 7 ok (0.001s) testAdd (__main__.TestLazyReuseIr) ... testAdd succeeded - num_retries_left: 6 ok (0.001s) testAdd (__main__.TestLazyReuseIr) ... testAdd succeeded - num_retries_left: 5 ok (0.001s) testAdd (__main__.TestLazyReuseIr) ... testAdd succeeded - num_retries_left: 4 ok (0.001s) testAdd (__main__.TestLazyReuseIr) ... testAdd succeeded - num_retries_left: 3 ok (0.001s) testAdd (__main__.TestLazyReuseIr) ... testAdd succeeded - num_retries_left: 2 ok (0.001s) testAdd (__main__.TestLazyReuseIr) ... testAdd succeeded - num_retries_left: 1 ok (0.001s) testAddSub (__main__.TestLazyReuseIr) ... testAdd succeeded - num_retries_left: 0 skip: Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run (0.001s) testAddSubFallback (__main__.TestLazyReuseIr) ... skip: Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run (0.001s) testBatchNorm (__main__.TestLazyReuseIr) ... skip: Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run (0.001s) ---------------------------------------------------------------------- Ran 54 tests in 1.264s OK (skipped=3) ``` Here is the sample rockset query ``` WITH added_row_number AS ( SELECT *, ROW_NUMBER() OVER(PARTITION BY name, classname, filename ORDER BY _event_time DESC) AS row_number FROM commons.rerun_disabled_tests ) SELECT name, classname, filename, flaky, num_green, num_red FROM added_row_number WHERE row_number = 1 AND name = 'testAdd' ``` Pull Request resolved: pytorch#89454 Approved by: https://github.com/clee2000

… core distributed (pytorch#89399) This PR moves dedup_tensors and its test to torch.distributed.checkpoint. This is a pre-req for enabling 2D checkpoint. This removes duplicated shards in list of SavePlan. It is used when saving DT with replicated placement. Docstring and comments will be added in the following PRs. Pull Request resolved: pytorch#89399 Approved by: https://github.com/wanchaol

…88904) In pytorch#87741 we added the inference support for dynamo/torchxla integration. Later on in pytorch#88449 we attempt to add the training support. That attempt is not smooth because - we try 2 things together 1. let dynamo trace the model on xla rather than eager 2. enable training - It turns out neither of these two tasks are trivial enough. Furthermore, item 2 (enable training) depends on item 1 (tracing on xla). We enable training via AOTAutograd. AOTAutograd lift all model parameters/buffers as graph inputs. Without item 1 being done, we would need copy all graph inputs (including model parameters/buffers) from eager device to xla devices. That hurts performance a lot. Have a cache to map eager parameter to XLA parameter does not solve the problem since the update on either will not sync automatically to the other. They will easily go out of sync. This PR let dynamo trace the model on XLA rather than eager. This is a preparation step to enabling training. Also, tracing on XLA makes the data movement more efficient. We see 1.5x geomean speedup compared to previous 1.38x. ``` +-------------------------+--------------------+-------------------------+ | Model | XLA (trace once) | XLA (trace everytime) | +=========================+====================+=========================+ | resnet18 | 1.38 | 1.008 | +-------------------------+--------------------+-------------------------+ | resnet50 | 1.227 | 0.998 | +-------------------------+--------------------+-------------------------+ | resnext50_32x4d | 1.544 | 1.008 | +-------------------------+--------------------+-------------------------+ | alexnet | 1.085 | 1.045 | +-------------------------+--------------------+-------------------------+ | mobilenet_v2 | 2.028 | 1.013 | +-------------------------+--------------------+-------------------------+ | mnasnet1_0 | 1.516 | 0.995 | +-------------------------+--------------------+-------------------------+ | squeezenet1_1 | 0.868 | 1.01 | +-------------------------+--------------------+-------------------------+ | vgg16 | 1.099 | 1.008 | +-------------------------+--------------------+-------------------------+ | BERT_pytorch | 3.26 | 1.027 | +-------------------------+--------------------+-------------------------+ | timm_vision_transformer | 2.182 | 1.015 | +-------------------------+--------------------+-------------------------+ | geomean | 1.50389 | 1.01261 | +-------------------------+--------------------+-------------------------+ ``` Example command ``` GPU_NUM_DEVICES=1 python benchmarks/dynamo/torchbench.py --randomize-input --performance --trace-on-xla --only resnet18 --backend=torchxla_trace_once ``` Pull Request resolved: pytorch#88904 Approved by: https://github.com/wconstab, https://github.com/JackCaoG, https://github.com/jansel

) Pull Request resolved: pytorch#89274 Approved by: https://github.com/jgong5, https://github.com/jansel

Reverts updates that were introduced by pytorch#89157 Pull Request resolved: pytorch#89449 Approved by: https://github.com/kit1980, https://github.com/huydhn, https://github.com/clee2000

…rch#89463) Summary: This permute copy change seems to be causing huge regressions on machines without AVX512. Revert to mitigate. This shouldn't be problematic since the improvement from changing it was super small anyways. Differential Revision: D41450088 Pull Request resolved: pytorch#89463 Approved by: https://github.com/hlu1

… distributed (pytorch#89398) This PR moves traverse and its test to torch.distributed.checkpoint. This is a pre-req for enabling 2D checkpoint. This is used when flatten nested dict and flatten sharded tensors. Docstring and comments will be added in the following PRs. Test: ``` python3 test/distributed/_tensor/parallel/test_2d_parallel.py ``` and CI Pull Request resolved: pytorch#89398 Approved by: https://github.com/wanchaol

…rm (pytorch#81761) (pytorch#84624) Reland pytorch#81761 Differential Revision: [D39332292](https://our.internmc.facebook.com/intern/diff/D39332292) Pull Request resolved: pytorch#84624 Approved by: https://github.com/kit1980

Pull Request resolved: pytorch#88990 Approved by: https://github.com/lezcano, https://github.com/peterbell10

Summary: Fix rounding issue in quantized shaders Test Plan: On Mac ``` cd ~/fbsource buck1 run -c pt.vulkan_full_precision=1 //xplat/caffe2:pt_vulkan_quantized_api_test_binAppleMac\#macosx-arm64 ``` On Android ``` cd ~/fbsource buck1 build -c ndk.custom_libcxx=false -c pt.enable_qpl=0 -c pt.vulkan_full_precision=1 //xplat/caffe2:pt_vulkan_quantized_api_test_binAndroid\#android-arm64 --show-output adb push buck-out/gen/xplat/caffe2/pt_vulkan_quantized_api_test_binAndroid\#android-arm64 /data/local/tmp/vulkan_quantized_api_test adb shell "/data/local/tmp/vulkan_quantized_api_test" ``` Reviewed By: salilsdesai Differential Revision: D41047095 Pull Request resolved: pytorch#89456 Approved by: https://github.com/kirklandsign, https://github.com/digantdesai

Fixes https://github.com/pytorch/torchdynamo/issues/1888 Signed-off-by: Edward Z. Yang <[email protected]> Differential Revision: [D41460986](https://our.internmc.facebook.com/intern/diff/D41460986) Pull Request resolved: pytorch#89464 Approved by: https://github.com/bdhirsh

pytorch#89317) Differential Revision: [D41415321](https://our.internmc.facebook.com/intern/diff/D41415321) Pull Request resolved: pytorch#89317 Approved by: https://github.com/kwen2501

…ch#89318) Differential Revision: [D41415324](https://our.internmc.facebook.com/intern/diff/D41415324) Pull Request resolved: pytorch#89318 Approved by: https://github.com/kwen2501

Summary: Fixes pytorch/torchdynamo#1797 Pull Request resolved: pytorch#89289 Approved by: https://github.com/ngimel, https://github.com/jansel, https://github.com/jgong5

…c quantization (pytorch#89248) Summary: split the is_decomposed logic for `_replace_observer_with_quantize_dequantize_node` in a separate function and added support for dynamic quantization in the decomposed version of this function. In case of dynamic quantization, we'll produce the following reference quantized pattern in decomposed mode: ``` x -> choose_qparams -> quantize_per_tensor -> dequantize_per_tensor -> linear ``` Test Plan: python test/test_quantization.py -k test__convert_to_reference_decomposed_fx_dynamic_quant Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: pytorch#89248 Approved by: https://github.com/vkuzo

Pull Request resolved: pytorch#89430 Approved by: https://github.com/ezyang

Fixes - T137631262 Caching conda dependencies for build workflows. Conda dependencies have been gathered from the workflow https://github.com/pytorch/pytorch/blob/master/.github/workflows/_buck-build-test.yml The pull request updates the action from `conda-incubator/setup-miniconda@v2` to `pytorch/test-infra/.github/actions/setup-miniconda@main` as it supports caching. Test Plan: Running the `ciflow/periodic` which runs the ci builds `buck-build-test` workflow. Expected output is to have all the conda dependencies cached. <img width="1227" alt="Screenshot 2022-11-22 at 15 44 20" src="https://user-images.githubusercontent.com/15447437/203343298-e55c384b-01ad-45c3-a5e9-ba5c53149be4.png"> Pull Request resolved: pytorch#89422 Approved by: https://github.com/huydhn

…orch#88089) This fixes some prod and masked.prod tests on Windows. np.prod uses int32 on Windows so it overflows. On Linux it uses by default int64. Fixes pytorch#77305 Fixes pytorch#77320 Fixes pytorch#77334 Fixes pytorch#77335 Fixes pytorch#77336 Fixes pytorch#77337 Pull Request resolved: pytorch#88089 Approved by: https://github.com/mruberry

Fixes #ISSUE_NUMBER Pull Request resolved: pytorch#89509 Approved by: https://github.com/ngimel, https://github.com/shunting314

Enables previously failing UCC distributed_test.py tests that are now fixed due to either ProcessGroupUCC barrier blocking fix (pytorch#86961) or UCC-side timeout error handling fix: (https://github.com/openucx/ucc/pull/679/files). Bump upstream UCC version to build UCC with timeout error handling fix merged in. Pull Request resolved: pytorch#89023 Approved by: https://github.com/kwen2501, https://github.com/malfet

Fixes #ISSUE_NUMBER Pull Request resolved: pytorch#89493 Approved by: https://github.com/H-Huang

Test still fails when run on 5 A100 GPUs, although it works with 5 V100s. Using 4 GPUs seems to be fine. Followup to pytorch#85957 Pull Request resolved: pytorch#86280 Approved by: https://github.com/awgu, https://github.com/kit1980

The test may fail due to slightly different values caused by different order of matrizes in SGEMM: > Mismatched elements: 1 / 50 (2.0%) > Greatest absolute difference: 1.430511474609375e-05 at index (4, 5) (up to 1e-05 allowed) > Greatest relative difference: 4.65393206065873e-06 at index (4, 5) (up to 1.3e-06 allowed) Observed on POWER (ppc64le) Pull Request resolved: pytorch#86365 Approved by: https://github.com/mruberry, https://github.com/kit1980

Replace the remaining hand-written code in vec256_float_vsx.h by calls to Sleef functions similar to what was done in pytorch#59382 & pytorch#82646 after pytorch#41541 This fixes wrong results for e.g. `sin(1e20)`. Fixes pytorch#85978 To fix pytorch#85978 I only needed to do the sin/cos functions to make the test pass but to not encounter the same issue again and again (see the previous PRs and issues) I checked the whole file for similar functions where a Sleef function could be used and changed those too. In the diff I've noticed the faulty whitespace so to make this complete I fixed that too, so it should now be done. Pull Request resolved: pytorch#86453 Approved by: https://github.com/malfet

Add commit date to build summary of dashboard. Make the date of the run reflective of when the run started, not when the run ended. Use PST (UTC -8) to determine day, rather than GMT (UTC +0). Test comment: pytorch/torchdynamo#1831 (comment) Pull Request resolved: pytorch#89517 Approved by: https://github.com/anijain2305

Fixes #ISSUE_NUMBER Pull Request resolved: pytorch#89455 Approved by: https://github.com/huydhn

@page

We observed that the native PyTorch LayerNormBackwardKernelImplInternal has suboptimal performance for certain input sizes on AMD GPUs especially when `fs` (=`config_m` in our benchmark script) is large and `bs` (=`config_n` in our benchmark script) is small (commonly seen in [the CvT model](https://arxiv.org/abs/2103.15808)) in the benchmark script of [PR pytorch#68238](pytorch#68238 (comment)) on AMD GPUs. This PR is to replace `GammaBetaBackwardCUDAKernel` with the Apex layernorm backward kernel with some ROCm-specific parameter tuning when `fs` (=`config_m`) is larger than 512 on AMD GPUs. There are a few PRs for LayerNorm kernel: - pytorch#26201 - pytorch#27634 - pytorch#68238 Therefore, we have tested and compared the kernel before and at this PR with the input shapes in the last two PRs along with those commonly used in the CvT model on AMD MI100. --- **Current** <html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40"> <head> <meta name=ProgId content=Excel.Sheet> <meta name=Generator content="Microsoft Excel 15"> <link id=Main-File rel=Main-File href="file:///C:/Users/hubertlu/AppData/Local/Temp/msohtmlclip1/01/clip.htm"> <link rel=File-List href="file:///C:/Users/hubertlu/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml">  </head> <body link="#0563C1" vlink="#954F72"> M | N | fwd (half) | fwdbwd (half) | fwd (float) | fwdbwd (float) -- | -- | -- | -- | -- | -- 50432 | 384 | 0.387256 | 1.372758 | 0.378975 | 1.47892 50176 | 384 | 0.38231 | 1.362416 | 0.378084 | 1.473886 200704 | 192 | 0.997859 | 4.315875 | 0.989306 | 4.560827 802816 | 64 | 3.671828 | 16.68013 | 3.613515 | 16.827946 200 | 256 | 0.066503 | 0.332096 | 0.071422 | 0.325349 1000 | 256 | 0.071848 | 0.333355 | 0.073038 | 0.334753 6000 | 256 | 0.086334 | 0.345139 | 0.086834 | 0.347429 6272 | 256 | 0.088601 | 0.347906 | 0.087855 | 0.351245 200 | 512 | 0.071626 | 0.329726 | 0.073798 | 0.326878 1000 | 512 | 0.073975 | 0.330226 | 0.074166 | 0.332751 6000 | 512 | 0.099617 | 0.362367 | 0.100095 | 0.378313 6272 | 512 | 0.100378 | 0.358066 | 0.099857 | 0.395982 200 | 1024 | 0.072954 | 0.326382 | 0.073899 | 0.333007 1000 | 1024 | 0.0743 | 0.325532 | 0.071126 | 0.330991 6000 | 1024 | 0.127025 | 0.390084 | 0.128692 | 0.471504 6272 | 1024 | 0.130704 | 0.403536 | 0.135244 | 0.487133 200 | 1536 | 0.070331 | 0.339169 | 0.070086 | 0.331015 1000 | 1536 | 0.075085 | 0.330042 | 0.076295 | 0.328778 6000 | 1536 | 0.148889 | 0.44949 | 0.155781 | 0.659987 6272 | 1536 | 0.154939 | 0.478871 | 0.17673 | 0.716025 200 | 2048 | 0.070269 | 0.335585 | 0.072804 | 0.334655 1000 | 2048 | 0.080094 | 0.326991 | 0.080426 | 0.32685 6000 | 2048 | 0.187888 | 0.623023 | 0.245762 | 0.981635 6272 | 2048 | 0.195431 | 0.65244 | 0.262574 | 1.008141 200 | 3072 | 0.068205 | 0.339428 | 0.073068 | 0.344034 1000 | 3072 | 0.087554 | 0.328899 | 0.09218 | 0.346433 6000 | 3072 | 0.240352 | 0.905058 | 0.368135 | 1.280462 6272 | 3072 | 0.26179 | 0.959387 | 0.387782 | 1.476524 128 | 2097152 | 5.905976 | 22.724793 | 10.287974 | 30.242092 256 | 1048576 | 4.561596 | 19.554308 | 10.223171 | 29.42371 512 | 524288 | 4.146751 | 22.7247 | 11.404285 | 39.175902 1024 | 262144 | 5.193135 | 23.403325 | 11.334512 | 38.947192 2048 | 131072 | 4.992907 | 23.377801 | 11.400286 | 40.889191 4096 | 65536 | 5.429488 | 24.275701 | 11.196778 | 41.4751 8192 | 32768 | 5.35758 | 21.360312 | 10.535418 | 42.875646 16384 | 16384 | 5.44947 | 20.852605 | 10.357685 | 34.603408 32768 | 8192 | 4.688925 | 17.379392 | 9.635596 | 31.188271 </body> </html> --------- **At this PR** <html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40"> <head> <meta name=ProgId content=Excel.Sheet> <meta name=Generator content="Microsoft Excel 15"> <link id=Main-File rel=Main-File href="file:///C:/Users/hubertlu/AppData/Local/Temp/msohtmlclip1/01/clip.htm"> <link rel=File-List href="file:///C:/Users/hubertlu/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml">  </head> <body link="#0563C1" vlink="#954F72"> M | N | fwd (half) | fwdbwd (half) | fwd (float) | fwdbwd (float) -- | -- | -- | -- | -- | -- 50432 | 384 | 0.38797 | 0.93103 | 0.37966 | 1.15283 50176 | 384 | 0.3874 | 0.96417 | 0.38462 | 1.18595 200704 | 192 | 1.00002 | 2.40876 | 0.99224 | 2.55579 802816 | 64 | 3.67348 | 7.98658 | 3.61871 | 7.72404 200 | 256 | 0.07292 | 0.35119 | 0.07195 | 0.32602 1000 | 256 | 0.07354 | 0.33325 | 0.07237 | 0.33742 6000 | 256 | 0.08819 | 0.33283 | 0.08453 | 0.3279 6272 | 256 | 0.0886 | 0.33446 | 0.08774 | 0.33426 200 | 512 | 0.0701 | 0.33505 | 0.07072 | 0.33018 1000 | 512 | 0.07042 | 0.33442 | 0.074 | 0.33206 6000 | 512 | 0.09931 | 0.34956 | 0.09895 | 0.3572 6272 | 512 | 0.10103 | 0.32976 | 0.10041 | 0.36635 200 | 1024 | 0.07144 | 0.33579 | 0.07209 | 0.33216 1000 | 1024 | 0.0736 | 0.32803 | 0.07286 | 0.32936 6000 | 1024 | 0.12584 | 0.38916 | 0.12852 | 0.48273 6272 | 1024 | 0.13053 | 0.38804 | 0.13464 | 0.49545 200 | 1536 | 0.07159 | 0.3396 | 0.07062 | 0.33545 1000 | 1536 | 0.07443 | 0.33239 | 0.07366 | 0.33204 6000 | 1536 | 0.14959 | 0.45043 | 0.15826 | 0.69119 6272 | 1536 | 0.1542 | 0.47644 | 0.18249 | 0.72208 200 | 2048 | 0.07258 | 0.33982 | 0.07412 | 0.33859 1000 | 2048 | 0.0793 | 0.32816 | 0.07864 | 0.32583 6000 | 2048 | 0.18973 | 0.571 | 0.25506 | 0.91796 6272 | 2048 | 0.19719 | 0.64208 | 0.26445 | 0.95055 200 | 3072 | 0.07092 | 0.33867 | 0.07104 | 0.34695 1000 | 3072 | 0.08727 | 0.33144 | 0.09144 | 0.36633 6000 | 3072 | 0.24683 | 0.87275 | 0.37761 | 1.3289 6272 | 3072 | 0.26437 | 0.91178 | 0.38496 | 1.53694 128 | 2097152 | 6.27936 | 23.69425 | 10.40004 | 30.13699 256 | 1048576 | 4.5404 | 19.47675 | 10.28494 | 29.36936 512 | 524288 | 4.13951 | 18.78771 | 10.09557 | 32.67083 1024 | 262144 | 4.47576 | 18.00411 | 9.56488 | 31.47117 2048 | 131072 | 4.28026 | 16.95619 | 9.40297 | 30.82845 4096 | 65536 | 4.2653 | 16.5018 | 9.03315 | 30.08392 8192 | 32768 | 4.25613 | 16.13583 | 8.9258 | 30.75296 16384 | 16384 | 4.20256 | 16.38207 | 9.52587 | 31.31113 32768 | 8192 | 4.20231 | 16.19452 | 9.31478 | 31.03514 </body> </html> --------- **Performance Improvement (%)** <html xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:dt="uuid:C2F41010-65B3-11d1-A29F-00AA00C14882" xmlns="http://www.w3.org/TR/REC-html40"> <head> <meta name=ProgId content=OneNote.File> <meta name=Generator content="Microsoft OneNote 15"> </head> <body lang=en-US style='font-family:Calibri;font-size:11.0pt'>  <div style='direction:ltr'> M | N | fwdbwd, torch.float16 | fwdbwd, torch.float32 -- | -- | -- | -- 50432 | 384 | 32.178 | 22.049 50176 | 384 | 29.231 | 19.536 200704 | 192 | 44.188 | 43.962 802816 | 64 | 52.119 | 54.100 200 | 256 | -5.750 | -0.206 1000 | 256 | 0.031 | -0.797 6000 | 256 | 3.566 | 5.621 6272 | 256 | 3.865 | 4.836 200 | 512 | -1.615 | -1.010 1000 | 512 | -1.270 | 0.208 6000 | 512 | 3.534 | 5.581 6272 | 512 | 7.905 | 7.483 200 | 1024 | -2.883 | 0.254 1000 | 1024 | -0.767 | 0.493 6000 | 1024 | 0.237 | -2.381 6272 | 1024 | 3.840 | -1.707 200 | 1536 | -0.127 | -1.340 1000 | 1536 | -0.711 | -0.992 6000 | 1536 | -0.209 | -4.728 6272 | 1536 | 0.508 | -0.846 200 | 2048 | -1.262 | -1.176 1000 | 2048 | -0.358 | 0.312 6000 | 2048 | 8.350 | 6.487 6272 | 2048 | 1.588 | 5.713 200 | 3072 | 0.223 | -0.848 1000 | 3072 | -0.773 | -5.743 6000 | 3072 | 3.570 | -3.783 6272 | 3072 | 4.962 | -4.092 128 | 2097152 | -4.266 | 0.348 256 | 1048576 | 0.397 | 0.185 512 | 524288 | 17.325 | 16.605 1024 | 262144 | 23.070 | 19.195 2048 | 131072 | 27.469 | 24.605 4096 | 65536 | 32.023 | 27.465 8192 | 32768 | 24.459 | 28.274 16384 | 16384 | 21.439 | 9.514 32768 | 8192 | 6.818 | 0.491 </div>  </body> </html> --------- **Benchmark script of this PR** ``` # Ref: # 1. pytorch#26201 # 2. pytorch#68238 from distutils.command.config import config import torch from torch.nn import LayerNorm import timeit number_runs = 1000 # TODO: Modify this to save time! def test_forward(layer_norm_cuda, input_cuda): layer_norm_cuda(input_cuda); torch.cuda.synchronize() def test_backward(out_cuda, layer_norm_grad_cuda, create_graph): out_cuda.backward(layer_norm_grad_cuda, retain_graph=True, create_graph=create_graph); torch.cuda.synchronize() def test_fwdbwd(input_cuda, layer_norm_cuda, gO): input_cuda.grad = None layer_norm_cuda.zero_grad(set_to_none=True) out = layer_norm_cuda(input_cuda) out.backward(gO) torch.cuda.synchronize() def benchmark(config_m, config_n): print("M | N | fwd (half) | fwdbwd (half) | fwd (float) | fwdbwd (float)") if len(config_m) != len(config_n): print("Please make sure the lengths of config_m and config_m are the same.") for i in range(len(config_m)): normalized_shape = config_n[i] results = [config_m[i], config_n[i]] for dtype in (torch.half, torch.float): if dtype == torch.half: layer_norm_cuda = LayerNorm(normalized_shape).half().cuda() else: layer_norm_cuda = LayerNorm(normalized_shape).cuda() input_cuda = torch.randn(config_m[i], config_n[i], device='cuda', dtype=dtype, requires_grad=True) # print("cuda forward:") result_fwd = timeit.timeit(lambda: test_forward(layer_norm_cuda, input_cuda), number=number_runs) results.append(result_fwd / number_runs * 1000) gO = torch.rand_like(input_cuda) result_fwdbwd = timeit.timeit(lambda: test_fwdbwd(input_cuda, layer_norm_cuda, gO), number=number_runs) results.append(result_fwdbwd / number_runs * 1000) print('{:09d}|{:09d}|{:9.5f}|{:9.5f}|{:9.5f}|{:9.5f}'.format(results[0], results[1], results[2], results[3], results[4], results[5])) print("Times are in microseconds (us).") # CVT config_m_cvt = [50432, 50176, 200704, 802816] config_n_cvt = [384, 384, 192, 64] # pytorch#68238 (comment) config_m_68238 = [200, 1000, 6000, 6272, 200, 1000, 6000, 6272, 200, 1000, 6000, 6272, 200, 1000, 6000, 6272, 200, 1000, 6000, 6272, 200, 1000, 6000, 6272] config_n_68238 = [256,256,256,256,512,512,512,512,1024,1024,1024,1024,1536,1536,1536,1536,2048,2048,2048,2048,3072,3072,3072,3072] # pytorch#27634 config_m_27634 = [128, 256, 512, 1024, 2048, 4096, 8192, 16384, 32768] config_n_27634 = [2097152, 1048576, 524288, 262144, 131072, 65536, 32768, 16384, 8192] config_m = config_m_cvt + config_m_68238 + config_m_27634 config_n = config_n_cvt + config_n_68238 + config_n_27634 benchmark(config_m, config_n) ``` CC: @jeffdaily Pull Request resolved: pytorch#87635 Approved by: https://github.com/jataylo, https://github.com/jeffdaily, https://github.com/ezyang

With both quantization/non-quantization supported.

Guobing-Chen requested review from EikanWang, jgong5 and Xia-Weiwen July 18, 2022 08:16

jgong5 reviewed Jul 19, 2022

View reviewed changes

torch/csrc/jit/tensorexpr/external_functions.cpp Outdated Show resolved Hide resolved

jgong5 approved these changes Jul 19, 2022

View reviewed changes

Guobing-Chen force-pushed the nnc_quant_op_maxpool2d branch 2 times, most recently from 29c55d7 to e917fd4 Compare August 26, 2022 02:36

ezyang and others added 23 commits November 16, 2022 01:08

Add int64_t, SymInt overloads for all binary operators in C++ (pytorc…

d96dd8f

…h#89063) Signed-off-by: Edward Z. Yang <[email protected]> Pull Request resolved: pytorch#89063 Approved by: https://github.com/SherlockNoMad

Revert "[Inductor] Build FX Linear + Permute Vertical Fusion in Induc…

9f0b2c7

…tor (pytorch#88859)" This reverts commit d60abe4. Reverted pytorch#88859 on behalf of https://github.com/kit1980 due to Broke Mac OS testing, which were clearly shown in CI

[FSDP] Test named_parameters() in forward (use_orig_params=True) (p…

397f100

…ytorch#89066) This adds a unit test following the FSDP change in pytorch#88781. Pull Request resolved: pytorch#89066 Approved by: https://github.com/fegin

Symintify numel(), infer_size, prims.elementwise_meta (pytorch#88956)

ce2f870

Pull Request resolved: pytorch#88956 Approved by: https://github.com/ezyang

Easier to understand event_dim computation (pytorch#81396)

8ebbd5a

Fixes pytorch#81254 Only easier to understand, not a real fix. Pull Request resolved: pytorch#81396 Approved by: https://github.com/fritzo, https://github.com/kit1980

[dtensor] PART 5: move DTensor basic tests to core distributed (pytor…

527c5bd

…ch#88178) This PR moves DTensor basic tests to torch.distributed, including dtensor, device_mesh tests part of pytorch#88838 Pull Request resolved: pytorch#88178 Approved by: https://github.com/fduwjj

[dtensor] PART 7: move remaining DTensor tests to core distributed (p…

0230e52

…ytorch#88179) This PR moves remaining tests, i.e. tensor_ops, op db tests to core distributed part of pytorch#88838 Pull Request resolved: pytorch#88179 Approved by: https://github.com/aazzolini

[dtensor] PART 8: move tensor parallel api and tests to core distribu…

f20b3f2

…ted (pytorch#88180) This PR moves tensor/parallel folder and tests to torch.distributed. part of pytorch#88838 Pull Request resolved: pytorch#88180 Approved by: https://github.com/aazzolini

[dynamo] Support if cond on NNModuleVariable (pytorch#89095)

9d2f5a2

Pull Request resolved: pytorch#89095 Approved by: https://github.com/yanboliang, https://github.com/mlazos

Revert "Rewrite assert statement with torch._assert under config (pyt…

9d28775

…orch#88246)" This reverts commit 62ba15e. Reverted pytorch#88246 on behalf of https://github.com/DanilBaibak due to breaking internal builds

Add meta impl for grid_sampler_2d_backward (pytorch#88745)

dc40d3f

TODO: add an OpInfo Pull Request resolved: pytorch#88745 Approved by: https://github.com/ezyang

huydhn and others added 29 commits November 22, 2022 03:39

Support masked_fill to address the GPT2 performance issue (pytorch#89274

40cf214

) Pull Request resolved: pytorch#89274 Approved by: https://github.com/jgong5, https://github.com/jansel

Revert submodule updates introduced by pytorch#89157 (pytorch#89449)

f2cf1b0

Reverts updates that were introduced by pytorch#89157 Pull Request resolved: pytorch#89449 Approved by: https://github.com/kit1980, https://github.com/huydhn, https://github.com/clee2000

Vectorized CPU code implementing right shift operator. (pytorch#88990)

0f7dca1

Pull Request resolved: pytorch#88990 Approved by: https://github.com/lezcano, https://github.com/peterbell10

[18/N] Add allgather_coalesced custom op with CPU/CUDA implementations (

be22b5d

pytorch#89317) Differential Revision: [D41415321](https://our.internmc.facebook.com/intern/diff/D41415321) Pull Request resolved: pytorch#89317 Approved by: https://github.com/kwen2501

[19/N] Add monitored_barrier custom op with CPU implementation (pytor…

5797f74

…ch#89318) Differential Revision: [D41415324](https://our.internmc.facebook.com/intern/diff/D41415324) Pull Request resolved: pytorch#89318 Approved by: https://github.com/kwen2501

[inductor] generate nan in the cpp backend (pytorch#89289)

2823fc5

Summary: Fixes pytorch/torchdynamo#1797 Pull Request resolved: pytorch#89289 Approved by: https://github.com/ngimel, https://github.com/jansel, https://github.com/jgong5

Meta impl for linalg_cholesky and linalg_cholesky_ex (pytorch#89430)

9c0bf93

Pull Request resolved: pytorch#89430 Approved by: https://github.com/ezyang

Fix benchmarks - xla tensor test (pytorch#89509)

f281f43

Fixes #ISSUE_NUMBER Pull Request resolved: pytorch#89509 Approved by: https://github.com/ngimel, https://github.com/shunting314

Fix dev-discuss link in the maintainer docs (pytorch#89493)

c2ce79f

Fixes #ISSUE_NUMBER Pull Request resolved: pytorch#89493 Approved by: https://github.com/H-Huang

Shard windows periodic job more (pytorch#89455)

00b7d8e

Fixes #ISSUE_NUMBER Pull Request resolved: pytorch#89455 Approved by: https://github.com/huydhn

Enable maxpool_2d in NNC

d17ddf0

With both quantization/non-quantization supported.

fix maxpool2d output buf dtype

aadf974

pytorchmergebot force-pushed the nnc_quant_op_maxpool2d branch from 5ae3a1d to aadf974 Compare November 23, 2022 02:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Enable maxpool_2d in NNC #3

Enable maxpool_2d in NNC #3

Uh oh!

Guobing-Chen commented Jul 18, 2022

Uh oh!

Uh oh!

Uh oh!

Enable maxpool_2d in NNC #3

Are you sure you want to change the base?

Enable maxpool_2d in NNC #3

Uh oh!

Conversation

Guobing-Chen commented Jul 18, 2022

Uh oh!

Uh oh!

Uh oh!