Skip to content

Conversation

Guobing-Chen
Copy link
Owner

This PR implemented maxpool2d for NNC, and is also another follow-up PR to enable quantization/channelsLast at OP-level.

  • maxpool2d NNC lowering function implementation and lowering path enabling:
torch/csrc/jit/tensorexpr/operators/reduction.cpp
torch/csrc/jit/tensorexpr/operators/reduction.h
torch/csrc/jit/tensorexpr/lowerings.cpp
  • maxpool2d NNC external call implementations for both default and out version
torch/csrc/jit/tensorexpr/external_functions.cpp
torch/csrc/jit/tensorexpr/codegen.cpp
  • Add related test case for quantization and non-quantization scenarios:
test/cpp/tensorexpr/test_quantization.cpp

@Guobing-Chen Guobing-Chen force-pushed the nnc_quant_op_maxpool2d branch 2 times, most recently from 29c55d7 to e917fd4 Compare August 26, 2022 02:36
ezyang and others added 23 commits November 16, 2022 01:08
This reduces boilerplate.  Also, I plan to add a template
parameter to ConvParams; without moving the methods onto the
struct, I would have to manually template every method.

Signed-off-by: Edward Z. Yang <[email protected]>

Pull Request resolved: pytorch#89062
Approved by: https://github.com/SherlockNoMad
…tor (pytorch#88859)"

This reverts commit d60abe4.

Reverted pytorch#88859 on behalf of https://github.com/kit1980 due to Broke Mac OS testing, which were clearly shown in CI
Now that periodic jobs are run under `mem_leak_check` mode with parallelization turning off. It's very easy for `linux-bionic-cuda11.6-py3-gcc7-slow-gradcheck / test` to timeout because one of the shards is very close to the 4h mark.

* https://hud.pytorch.org/pytorch/pytorch/commit/2452e3f99a072760fc46d3f9025aaa37ca7ea2ab
* https://hud.pytorch.org/pytorch/pytorch/commit/35e668b5ced25e735b6e523d557ed7fd60267914

Pull Request resolved: pytorch#89079
Approved by: https://github.com/clee2000
…ytorch#89066)

This adds a unit test following the FSDP change in pytorch#88781.
Pull Request resolved: pytorch#89066
Approved by: https://github.com/fegin
… call (pytorch#89029)

# Summary
Creates a callable native function that can determine which implementation of scaled dot product will get called. This allows to bump re-order the runtime dispatch of SDP to enable autograd.
Pull Request resolved: pytorch#89029
Approved by: https://github.com/cpuhrsch
…85624)

When building products using PyTorch, it is often required to display license terms for all dependencies.
The feature itself has been implemented in pytorch#81500 but it seems there are no options to enable it.
This PR implements the option.

cc/ @mattip @rgommers
Pull Request resolved: pytorch#85624
Approved by: https://github.com/rgommers, https://github.com/seemethere
Summary: Support shape padding for aten.mm in Inductor (originally from [pytorch#88709](pytorch#88709))

Differential Revision: D41315078

Pull Request resolved: pytorch#89086
Approved by: https://github.com/jianyuh
Inductor test report artifacts are now on HUD but its files are in CSV format instead of the default XML files from pytest or unittest that we expect. So this PR uploads both suffixes

Pull Request resolved: pytorch#89112
Approved by: https://github.com/desertfire
…ytorch#88549)

This PR creates `torch.distributed._tensor` package and moves
DeviceMesh, PlacementTypes to it

part of pytorch#88838
Pull Request resolved: pytorch#88549
Approved by: https://github.com/fduwjj
…ed (pytorch#88176)

This PR moves the core DTensor abstraction and high level APIs to
torch.distributed._tensor folder, which includes the following:
1. DTensor class
2. high level APIs (distribute_tensor/module)
3. dispatching logic
4. redistribute logic

part of pytorch#88838
Pull Request resolved: pytorch#88176
Approved by: https://github.com/fduwjj
…88177)

This PR moves most DTensor ops to torch.distributed._tensor. We will
add all tests in the following PRs.

part of pytorch#88838
Pull Request resolved: pytorch#88177
Approved by: https://github.com/fduwjj
…orch#88550)

This PR moves the view related DTensor ops to core distributed,
tests will be add in follow up PRs

part of pytorch#88838
Pull Request resolved: pytorch#88550
Approved by: https://github.com/fduwjj
…ch#88178)

This PR moves DTensor basic tests to torch.distributed, including
dtensor, device_mesh tests

part of pytorch#88838
Pull Request resolved: pytorch#88178
Approved by: https://github.com/fduwjj
…88551)

This PR moves DTensor op tests to core distributed, including
prop_rule, pointwise op, matrix op tests, etc.

part of pytorch#88838
Pull Request resolved: pytorch#88551
Approved by: https://github.com/aazzolini
…ytorch#88179)

This PR moves remaining tests, i.e. tensor_ops, op db tests to core distributed

part of pytorch#88838
Pull Request resolved: pytorch#88179
Approved by: https://github.com/aazzolini
…ted (pytorch#88180)

This PR moves tensor/parallel folder and tests to torch.distributed.

part of pytorch#88838
Pull Request resolved: pytorch#88180
Approved by: https://github.com/aazzolini
…orch#89118)

Summary:
Build fx-based linear/matmul/bmm + permute/transpose vertical fusion in Inductor

For an internal Ads model: **1.15x -> 1.36x speedup**

Test Plan: CI

Reviewed By: bertmaher, jansel, jianyuh

Differential Revision: D41071665

Pull Request resolved: pytorch#89118
Approved by: https://github.com/jianyuh
huydhn and others added 29 commits November 22, 2022 03:39
…ed-tests mode (pytorch#89454)

When looking into Rockset data for disabled test unittest, for example `testAdd`, I see that it's re-run only 3 times instead of 50+ times as expected under rerun-disabled -test mode

```
[
  {
    "name": "testAdd",
    "classname": "TestLazyReuseIr",
    "filename": "lazy/test_reuse_ir.py",
    "flaky": false,
    "num_green": 3,
    "num_red": 0
  }
]
```

It turns out that I made a mistake mixing `RERUN_DISABLED_TESTS` and `report_only` into `(RERUN_DISABLED_TESTS or report_only) and num_retries_left < MAX_NUM_RETRIES` in pytorch#88646.  The retrying logic for successful tests under rerun-disabled-tests mode is never executed because num_retries_left would be equal to MAX_NUM_RETRIES (not smaller) if the very first run successes. Thus, the sample test `testAdd` finishes right away (1 success count)

* `report_only` and `RERUN_DISABLED_TESTS` are 2 different things and shouldn't be mixed together. RERUN_DISABLED_TESTS has the higher priority.
* We also don't want to retry skipped tests under rerun-disabled-tests mode because they are only skipped due to `check_if_enable` check `Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run`

### Testing

* CI https://github.com/pytorch/pytorch/actions/runs/3518228784 generates https://gha-artifacts.s3.amazonaws.com/pytorch/pytorch/3518228784/1/artifact/test-reports-test-default-4-4-linux.4xlarge.nvidia.gpu_9627285587.zip in which `testAdd` is correctly called multiple times and `TestLazyReuseIr` is skipped correctly
* Locally

```
# export CI=1
# export PYTORCH_RETRY_TEST_CASES=1
# export PYTORCH_OVERRIDE_FLAKY_SIGNAL=1
# export PYTORCH_TEST_RERUN_DISABLED_TESTS=1
$ python test/run_test.py --verbose -i lazy/test_reuse_ir
Ignoring disabled issues:  []
Selected tests:
 lazy/test_reuse_ir
Prioritized test from test file changes.
reordering tests for PR:
prioritized: []
the rest: ['lazy/test_reuse_ir']

Downloading https://raw.githubusercontent.com/pytorch/test-infra/generated-stats/stats/slow-tests.json to /Users/huydo/Storage/mine/pytorch/test/.pytorch-slow-tests.json
Downloading https://raw.githubusercontent.com/pytorch/test-infra/generated-stats/stats/disabled-tests-condensed.json to /Users/huydo/Storage/mine/pytorch/test/.pytorch-disabled-tests.json
parallel (file granularity) tests:
 lazy/test_reuse_ir
serial (file granularity) tests:

Ignoring disabled issues:  []
Ignoring disabled issues:  []
Running lazy/test_reuse_ir ... [2022-11-21 13:21:07.165877]
Executing ['/Users/huydo/miniconda3/envs/py3.9/bin/python', '-bb', 'lazy/test_reuse_ir.py', '-v', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2022-11-21 13:21:07.166279]

Expand the folded group to see the log file of lazy/test_reuse_ir
##[group]PRINTING LOG FILE of lazy/test_reuse_ir (/Users/huydo/Storage/mine/pytorch/test/test-reports/lazy-test_reuse_ir_6cf_dxa1)

Running tests...
----------------------------------------------------------------------
Test results will be stored in test-reports/python-unittest/lazy.test_reuse_ir
  testAdd (__main__.TestLazyReuseIr) ... ok (1.215s)
  testAdd (__main__.TestLazyReuseIr) ...     testAdd succeeded - num_retries_left: 50
ok (0.001s)
  testAdd (__main__.TestLazyReuseIr) ...     testAdd succeeded - num_retries_left: 49
ok (0.001s)
  testAdd (__main__.TestLazyReuseIr) ...     testAdd succeeded - num_retries_left: 48
ok (0.001s)
  testAdd (__main__.TestLazyReuseIr) ...     testAdd succeeded - num_retries_left: 47
ok (0.001s)
  testAdd (__main__.TestLazyReuseIr) ...     testAdd succeeded - num_retries_left: 46
ok (0.001s)
  testAdd (__main__.TestLazyReuseIr) ...     testAdd succeeded - num_retries_left: 45
ok (0.001s)
  testAdd (__main__.TestLazyReuseIr) ...     testAdd succeeded - num_retries_left: 44
ok (0.001s)
  testAdd (__main__.TestLazyReuseIr) ...     testAdd succeeded - num_retries_left: 43
ok (0.001s)
  testAdd (__main__.TestLazyReuseIr) ...     testAdd succeeded - num_retries_left: 42
ok (0.001s)
  testAdd (__main__.TestLazyReuseIr) ...     testAdd succeeded - num_retries_left: 41
ok (0.001s)
  testAdd (__main__.TestLazyReuseIr) ...     testAdd succeeded - num_retries_left: 40
ok (0.001s)
  testAdd (__main__.TestLazyReuseIr) ...     testAdd succeeded - num_retries_left: 39
ok (0.001s)
  testAdd (__main__.TestLazyReuseIr) ...     testAdd succeeded - num_retries_left: 38
ok (0.001s)
  testAdd (__main__.TestLazyReuseIr) ...     testAdd succeeded - num_retries_left: 37
ok (0.001s)
  testAdd (__main__.TestLazyReuseIr) ...     testAdd succeeded - num_retries_left: 36
ok (0.001s)
  testAdd (__main__.TestLazyReuseIr) ...     testAdd succeeded - num_retries_left: 35
ok (0.001s)
  testAdd (__main__.TestLazyReuseIr) ...     testAdd succeeded - num_retries_left: 34
ok (0.001s)
  testAdd (__main__.TestLazyReuseIr) ...     testAdd succeeded - num_retries_left: 33
ok (0.001s)
  testAdd (__main__.TestLazyReuseIr) ...     testAdd succeeded - num_retries_left: 32
ok (0.001s)
  testAdd (__main__.TestLazyReuseIr) ...     testAdd succeeded - num_retries_left: 31
ok (0.001s)
  testAdd (__main__.TestLazyReuseIr) ...     testAdd succeeded - num_retries_left: 30
ok (0.001s)
  testAdd (__main__.TestLazyReuseIr) ...     testAdd succeeded - num_retries_left: 29
ok (0.001s)
  testAdd (__main__.TestLazyReuseIr) ...     testAdd succeeded - num_retries_left: 28
ok (0.001s)
  testAdd (__main__.TestLazyReuseIr) ...     testAdd succeeded - num_retries_left: 27
ok (0.001s)
  testAdd (__main__.TestLazyReuseIr) ...     testAdd succeeded - num_retries_left: 26
ok (0.001s)
  testAdd (__main__.TestLazyReuseIr) ...     testAdd succeeded - num_retries_left: 25
ok (0.001s)
  testAdd (__main__.TestLazyReuseIr) ...     testAdd succeeded - num_retries_left: 24
ok (0.001s)
  testAdd (__main__.TestLazyReuseIr) ...     testAdd succeeded - num_retries_left: 23
ok (0.001s)
  testAdd (__main__.TestLazyReuseIr) ...     testAdd succeeded - num_retries_left: 22
ok (0.001s)
  testAdd (__main__.TestLazyReuseIr) ...     testAdd succeeded - num_retries_left: 21
ok (0.001s)
  testAdd (__main__.TestLazyReuseIr) ...     testAdd succeeded - num_retries_left: 20
ok (0.001s)
  testAdd (__main__.TestLazyReuseIr) ...     testAdd succeeded - num_retries_left: 19
ok (0.001s)
  testAdd (__main__.TestLazyReuseIr) ...     testAdd succeeded - num_retries_left: 18
ok (0.001s)
  testAdd (__main__.TestLazyReuseIr) ...     testAdd succeeded - num_retries_left: 17
ok (0.001s)
  testAdd (__main__.TestLazyReuseIr) ...     testAdd succeeded - num_retries_left: 16
ok (0.001s)
  testAdd (__main__.TestLazyReuseIr) ...     testAdd succeeded - num_retries_left: 15
ok (0.001s)
  testAdd (__main__.TestLazyReuseIr) ...     testAdd succeeded - num_retries_left: 14
ok (0.001s)
  testAdd (__main__.TestLazyReuseIr) ...     testAdd succeeded - num_retries_left: 13
ok (0.001s)
  testAdd (__main__.TestLazyReuseIr) ...     testAdd succeeded - num_retries_left: 12
ok (0.001s)
  testAdd (__main__.TestLazyReuseIr) ...     testAdd succeeded - num_retries_left: 11
ok (0.001s)
  testAdd (__main__.TestLazyReuseIr) ...     testAdd succeeded - num_retries_left: 10
ok (0.001s)
  testAdd (__main__.TestLazyReuseIr) ...     testAdd succeeded - num_retries_left: 9
ok (0.001s)
  testAdd (__main__.TestLazyReuseIr) ...     testAdd succeeded - num_retries_left: 8
ok (0.001s)
  testAdd (__main__.TestLazyReuseIr) ...     testAdd succeeded - num_retries_left: 7
ok (0.001s)
  testAdd (__main__.TestLazyReuseIr) ...     testAdd succeeded - num_retries_left: 6
ok (0.001s)
  testAdd (__main__.TestLazyReuseIr) ...     testAdd succeeded - num_retries_left: 5
ok (0.001s)
  testAdd (__main__.TestLazyReuseIr) ...     testAdd succeeded - num_retries_left: 4
ok (0.001s)
  testAdd (__main__.TestLazyReuseIr) ...     testAdd succeeded - num_retries_left: 3
ok (0.001s)
  testAdd (__main__.TestLazyReuseIr) ...     testAdd succeeded - num_retries_left: 2
ok (0.001s)
  testAdd (__main__.TestLazyReuseIr) ...     testAdd succeeded - num_retries_left: 1
ok (0.001s)
  testAddSub (__main__.TestLazyReuseIr) ...     testAdd succeeded - num_retries_left: 0
skip: Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run (0.001s)
  testAddSubFallback (__main__.TestLazyReuseIr) ... skip: Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run (0.001s)
  testBatchNorm (__main__.TestLazyReuseIr) ... skip: Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run (0.001s)

----------------------------------------------------------------------
Ran 54 tests in 1.264s

OK (skipped=3)
```

Here is the sample rockset query

```
WITH added_row_number AS (
  SELECT
    *,
    ROW_NUMBER() OVER(PARTITION BY name, classname, filename ORDER BY _event_time DESC) AS row_number
  FROM
    commons.rerun_disabled_tests
)
SELECT
  name,
  classname,
  filename,
  flaky,
  num_green,
  num_red
FROM
  added_row_number
WHERE
  row_number = 1
  AND name = 'testAdd'
```
Pull Request resolved: pytorch#89454
Approved by: https://github.com/clee2000
… core distributed (pytorch#89399)

This PR moves dedup_tensors and its test to torch.distributed.checkpoint. This is a pre-req for enabling 2D checkpoint.

This removes duplicated shards in list of SavePlan. It is used when saving DT with replicated placement.

Docstring and comments will be added in the following PRs.
Pull Request resolved: pytorch#89399
Approved by: https://github.com/wanchaol
…88904)

In pytorch#87741 we added the inference support for dynamo/torchxla integration. Later on in pytorch#88449 we attempt to add the training support. That attempt is not smooth because
- we try 2 things together
   1. let dynamo trace the model on xla rather than eager
   2. enable training
- It turns out neither of these two tasks are trivial enough.

Furthermore, item 2 (enable training) depends on item 1 (tracing on xla). We enable training via AOTAutograd. AOTAutograd lift all model parameters/buffers as graph inputs. Without item 1 being done, we would need copy all graph inputs (including model parameters/buffers) from eager device to xla devices. That hurts performance a lot. Have a cache to map eager parameter to XLA parameter does not solve the problem since the update on either will not sync automatically to the other. They will easily go out of sync.

This PR let dynamo trace the model on XLA rather than eager. This is a preparation step to enabling training.

Also, tracing on XLA makes the data movement more efficient. We see 1.5x geomean speedup compared to previous 1.38x.
```
+-------------------------+--------------------+-------------------------+
| Model                   |   XLA (trace once) |   XLA (trace everytime) |
+=========================+====================+=========================+
| resnet18                |            1.38    |                 1.008   |
+-------------------------+--------------------+-------------------------+
| resnet50                |            1.227   |                 0.998   |
+-------------------------+--------------------+-------------------------+
| resnext50_32x4d         |            1.544   |                 1.008   |
+-------------------------+--------------------+-------------------------+
| alexnet                 |            1.085   |                 1.045   |
+-------------------------+--------------------+-------------------------+
| mobilenet_v2            |            2.028   |                 1.013   |
+-------------------------+--------------------+-------------------------+
| mnasnet1_0              |            1.516   |                 0.995   |
+-------------------------+--------------------+-------------------------+
| squeezenet1_1           |            0.868   |                 1.01    |
+-------------------------+--------------------+-------------------------+
| vgg16                   |            1.099   |                 1.008   |
+-------------------------+--------------------+-------------------------+
| BERT_pytorch            |            3.26    |                 1.027   |
+-------------------------+--------------------+-------------------------+
| timm_vision_transformer |            2.182   |                 1.015   |
+-------------------------+--------------------+-------------------------+
| geomean                 |            1.50389 |                 1.01261 |
+-------------------------+--------------------+-------------------------+
```

Example command
```
GPU_NUM_DEVICES=1 python benchmarks/dynamo/torchbench.py --randomize-input --performance --trace-on-xla --only resnet18 --backend=torchxla_trace_once
```

Pull Request resolved: pytorch#88904
Approved by: https://github.com/wconstab, https://github.com/JackCaoG, https://github.com/jansel
…rch#89463)

Summary: This permute copy change seems to be causing huge regressions on machines without AVX512. Revert to mitigate. This shouldn't be problematic since the improvement from changing it was super small anyways.

Differential Revision: D41450088

Pull Request resolved: pytorch#89463
Approved by: https://github.com/hlu1
… distributed (pytorch#89398)

This PR moves traverse and its test to torch.distributed.checkpoint. This is a pre-req for enabling 2D checkpoint.

This is used when flatten nested dict and flatten sharded tensors.

Docstring and comments will be added in the following PRs.

Test:
```
python3 test/distributed/_tensor/parallel/test_2d_parallel.py
```
and CI
Pull Request resolved: pytorch#89398
Approved by: https://github.com/wanchaol
Summary: Fix rounding issue in quantized shaders

Test Plan:
On Mac
```
cd ~/fbsource
buck1 run -c pt.vulkan_full_precision=1 //xplat/caffe2:pt_vulkan_quantized_api_test_binAppleMac\#macosx-arm64
```

On Android
```
cd ~/fbsource
buck1 build -c ndk.custom_libcxx=false -c pt.enable_qpl=0 -c pt.vulkan_full_precision=1 //xplat/caffe2:pt_vulkan_quantized_api_test_binAndroid\#android-arm64 --show-output
adb push buck-out/gen/xplat/caffe2/pt_vulkan_quantized_api_test_binAndroid\#android-arm64 /data/local/tmp/vulkan_quantized_api_test
adb shell "/data/local/tmp/vulkan_quantized_api_test"
```

Reviewed By: salilsdesai

Differential Revision: D41047095

Pull Request resolved: pytorch#89456
Approved by: https://github.com/kirklandsign, https://github.com/digantdesai
…c quantization (pytorch#89248)

Summary:
split the is_decomposed logic for `_replace_observer_with_quantize_dequantize_node` in a separate function and added support for dynamic quantization in the decomposed version of this function.

In case of dynamic quantization, we'll produce the following reference quantized pattern in decomposed mode:
```
x -> choose_qparams -> quantize_per_tensor -> dequantize_per_tensor -> linear
```

Test Plan:
python test/test_quantization.py -k test__convert_to_reference_decomposed_fx_dynamic_quant

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: pytorch#89248
Approved by: https://github.com/vkuzo
Fixes - T137631262

Caching conda dependencies for build workflows.
Conda dependencies have been gathered from the workflow https://github.com/pytorch/pytorch/blob/master/.github/workflows/_buck-build-test.yml

The pull request updates the action from `conda-incubator/setup-miniconda@v2` to `pytorch/test-infra/.github/actions/setup-miniconda@main` as it supports caching.

Test Plan:

Running the `ciflow/periodic` which runs the ci builds `buck-build-test` workflow. Expected output is to have all the conda dependencies cached.

<img width="1227" alt="Screenshot 2022-11-22 at 15 44 20" src="https://user-images.githubusercontent.com/15447437/203343298-e55c384b-01ad-45c3-a5e9-ba5c53149be4.png">

Pull Request resolved: pytorch#89422
Approved by: https://github.com/huydhn
…orch#88089)

This fixes some prod and masked.prod tests on Windows.

np.prod uses int32 on Windows so it overflows.

On Linux it uses by default int64.

Fixes pytorch#77305
Fixes pytorch#77320
Fixes pytorch#77334
Fixes pytorch#77335
Fixes pytorch#77336
Fixes pytorch#77337

Pull Request resolved: pytorch#88089
Approved by: https://github.com/mruberry
Enables previously failing UCC distributed_test.py tests that are now fixed due to either ProcessGroupUCC barrier blocking fix (pytorch#86961) or UCC-side timeout error handling fix:  (https://github.com/openucx/ucc/pull/679/files). Bump upstream UCC version to build UCC with timeout error handling fix merged in.

Pull Request resolved: pytorch#89023
Approved by: https://github.com/kwen2501, https://github.com/malfet
Test still fails when run on 5 A100 GPUs, although it works with 5 V100s. Using 4 GPUs seems to be fine.

Followup to pytorch#85957

Pull Request resolved: pytorch#86280
Approved by: https://github.com/awgu, https://github.com/kit1980
The test may fail due to slightly different values caused by different order of matrizes in SGEMM:

> Mismatched elements: 1 / 50 (2.0%)
> Greatest absolute difference: 1.430511474609375e-05 at index (4, 5) (up to 1e-05 allowed)
> Greatest relative difference: 4.65393206065873e-06 at index (4, 5) (up to 1.3e-06 allowed)

Observed on POWER (ppc64le)

Pull Request resolved: pytorch#86365
Approved by: https://github.com/mruberry, https://github.com/kit1980
Replace the remaining hand-written code in vec256_float_vsx.h by calls to Sleef functions similar to what was done in pytorch#59382 & pytorch#82646 after pytorch#41541

This fixes wrong results for e.g. `sin(1e20)`.
Fixes pytorch#85978

To fix pytorch#85978 I only needed to do the sin/cos functions to make the test pass but to not encounter the same issue again and again (see the previous PRs and issues) I checked the whole file for similar functions where a Sleef function could be used and changed those too. In the diff I've noticed the faulty whitespace so to make this complete I fixed that too, so it should now be done.

Pull Request resolved: pytorch#86453
Approved by: https://github.com/malfet
Add commit date to build summary of dashboard. Make the date of the run reflective of when the run started, not when the run ended. Use PST (UTC -8) to determine day, rather than GMT (UTC +0).

Test comment: pytorch/torchdynamo#1831 (comment)

Pull Request resolved: pytorch#89517
Approved by: https://github.com/anijain2305
We observed that the native PyTorch LayerNormBackwardKernelImplInternal has suboptimal performance for certain input sizes on AMD GPUs especially when `fs`  (=`config_m` in our benchmark script) is large and `bs`  (=`config_n` in our benchmark script) is small (commonly seen in [the CvT model](https://arxiv.org/abs/2103.15808)) in the benchmark script of [PR pytorch#68238](pytorch#68238 (comment)) on AMD GPUs.

This PR is to replace `GammaBetaBackwardCUDAKernel` with the Apex layernorm backward kernel with some ROCm-specific parameter tuning when `fs`  (=`config_m`) is larger than 512 on AMD GPUs.

There are a few PRs for LayerNorm kernel:
- pytorch#26201
- pytorch#27634
- pytorch#68238

Therefore, we have tested and compared the kernel before and at this PR with the input shapes in the last two PRs along with those commonly used in the CvT model on AMD MI100.

---
**Current**
<html xmlns:v="urn:schemas-microsoft-com:vml"
xmlns:o="urn:schemas-microsoft-com:office:office"
xmlns:x="urn:schemas-microsoft-com:office:excel"
xmlns="http://www.w3.org/TR/REC-html40">

<head>

<meta name=ProgId content=Excel.Sheet>
<meta name=Generator content="Microsoft Excel 15">
<link id=Main-File rel=Main-File
href="file:///C:/Users/hubertlu/AppData/Local/Temp/msohtmlclip1/01/clip.htm">
<link rel=File-List
href="file:///C:/Users/hubertlu/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml">
<!--table
	{mso-displayed-decimal-separator:"\.";
	mso-displayed-thousand-separator:"\,";}
@page
	{mso-header-data:"&L&\0022Arial\0022&10&K0000FF \[AMD Official Use Only - General\]&1\#\000D";
	margin:.75in .7in .75in .7in;
	mso-header-margin:.3in;
	mso-footer-margin:.3in;}
tr
	{mso-height-source:auto;}
col
	{mso-width-source:auto;}
br
	{mso-data-placement:same-cell;}
td
	{padding-top:1px;
	padding-right:1px;
	padding-left:1px;
	mso-ignore:padding;
	color:black;
	font-size:11.0pt;
	font-weight:400;
	font-style:normal;
	text-decoration:none;
	font-family:Calibri, sans-serif;
	mso-font-charset:0;
	mso-number-format:General;
	text-align:general;
	vertical-align:bottom;
	border:none;
	mso-background-source:auto;
	mso-pattern:auto;
	mso-protection:locked visible;
	white-space:nowrap;
	mso-rotate:0;}
-->
</head>

<body link="#0563C1" vlink="#954F72">

M | N | fwd (half) | fwdbwd (half) | fwd (float) | fwdbwd (float)
-- | -- | -- | -- | -- | --
50432 | 384 | 0.387256 | 1.372758 | 0.378975 | 1.47892
50176 | 384 | 0.38231 | 1.362416 | 0.378084 | 1.473886
200704 | 192 | 0.997859 | 4.315875 | 0.989306 | 4.560827
802816 | 64 | 3.671828 | 16.68013 | 3.613515 | 16.827946
200 | 256 | 0.066503 | 0.332096 | 0.071422 | 0.325349
1000 | 256 | 0.071848 | 0.333355 | 0.073038 | 0.334753
6000 | 256 | 0.086334 | 0.345139 | 0.086834 | 0.347429
6272 | 256 | 0.088601 | 0.347906 | 0.087855 | 0.351245
200 | 512 | 0.071626 | 0.329726 | 0.073798 | 0.326878
1000 | 512 | 0.073975 | 0.330226 | 0.074166 | 0.332751
6000 | 512 | 0.099617 | 0.362367 | 0.100095 | 0.378313
6272 | 512 | 0.100378 | 0.358066 | 0.099857 | 0.395982
200 | 1024 | 0.072954 | 0.326382 | 0.073899 | 0.333007
1000 | 1024 | 0.0743 | 0.325532 | 0.071126 | 0.330991
6000 | 1024 | 0.127025 | 0.390084 | 0.128692 | 0.471504
6272 | 1024 | 0.130704 | 0.403536 | 0.135244 | 0.487133
200 | 1536 | 0.070331 | 0.339169 | 0.070086 | 0.331015
1000 | 1536 | 0.075085 | 0.330042 | 0.076295 | 0.328778
6000 | 1536 | 0.148889 | 0.44949 | 0.155781 | 0.659987
6272 | 1536 | 0.154939 | 0.478871 | 0.17673 | 0.716025
200 | 2048 | 0.070269 | 0.335585 | 0.072804 | 0.334655
1000 | 2048 | 0.080094 | 0.326991 | 0.080426 | 0.32685
6000 | 2048 | 0.187888 | 0.623023 | 0.245762 | 0.981635
6272 | 2048 | 0.195431 | 0.65244 | 0.262574 | 1.008141
200 | 3072 | 0.068205 | 0.339428 | 0.073068 | 0.344034
1000 | 3072 | 0.087554 | 0.328899 | 0.09218 | 0.346433
6000 | 3072 | 0.240352 | 0.905058 | 0.368135 | 1.280462
6272 | 3072 | 0.26179 | 0.959387 | 0.387782 | 1.476524
128 | 2097152 | 5.905976 | 22.724793 | 10.287974 | 30.242092
256 | 1048576 | 4.561596 | 19.554308 | 10.223171 | 29.42371
512 | 524288 | 4.146751 | 22.7247 | 11.404285 | 39.175902
1024 | 262144 | 5.193135 | 23.403325 | 11.334512 | 38.947192
2048 | 131072 | 4.992907 | 23.377801 | 11.400286 | 40.889191
4096 | 65536 | 5.429488 | 24.275701 | 11.196778 | 41.4751
8192 | 32768 | 5.35758 | 21.360312 | 10.535418 | 42.875646
16384 | 16384 | 5.44947 | 20.852605 | 10.357685 | 34.603408
32768 | 8192 | 4.688925 | 17.379392 | 9.635596 | 31.188271

</body>

</html>

---------
**At this PR**
<html xmlns:v="urn:schemas-microsoft-com:vml"
xmlns:o="urn:schemas-microsoft-com:office:office"
xmlns:x="urn:schemas-microsoft-com:office:excel"
xmlns="http://www.w3.org/TR/REC-html40">

<head>

<meta name=ProgId content=Excel.Sheet>
<meta name=Generator content="Microsoft Excel 15">
<link id=Main-File rel=Main-File
href="file:///C:/Users/hubertlu/AppData/Local/Temp/msohtmlclip1/01/clip.htm">
<link rel=File-List
href="file:///C:/Users/hubertlu/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml">

<!--table
	{mso-displayed-decimal-separator:"\.";
	mso-displayed-thousand-separator:"\,";}
@page
	{mso-header-data:"&L&\0022Arial\0022&10&K0000FF \[AMD Official Use Only - General\]&1\#\000D";
	margin:.75in .7in .75in .7in;
	mso-header-margin:.3in;
	mso-footer-margin:.3in;}
tr
	{mso-height-source:auto;}
col
	{mso-width-source:auto;}
br
	{mso-data-placement:same-cell;}
td
	{padding-top:1px;
	padding-right:1px;
	padding-left:1px;
	mso-ignore:padding;
	color:black;
	font-size:11.0pt;
	font-weight:400;
	font-style:normal;
	text-decoration:none;
	font-family:Calibri, sans-serif;
	mso-font-charset:0;
	mso-number-format:General;
	text-align:general;
	vertical-align:bottom;
	border:none;
	mso-background-source:auto;
	mso-pattern:auto;
	mso-protection:locked visible;
	white-space:nowrap;
	mso-rotate:0;}
.xl63
	{color:windowtext;}
-->
</head>

<body link="#0563C1" vlink="#954F72">

M | N | fwd (half) | fwdbwd (half) | fwd (float) | fwdbwd (float)
-- | -- | -- | -- | -- | --
50432 | 384 | 0.38797 | 0.93103 | 0.37966 | 1.15283
50176 | 384 | 0.3874 | 0.96417 | 0.38462 | 1.18595
200704 | 192 | 1.00002 | 2.40876 | 0.99224 | 2.55579
802816 | 64 | 3.67348 | 7.98658 | 3.61871 | 7.72404
200 | 256 | 0.07292 | 0.35119 | 0.07195 | 0.32602
1000 | 256 | 0.07354 | 0.33325 | 0.07237 | 0.33742
6000 | 256 | 0.08819 | 0.33283 | 0.08453 | 0.3279
6272 | 256 | 0.0886 | 0.33446 | 0.08774 | 0.33426
200 | 512 | 0.0701 | 0.33505 | 0.07072 | 0.33018
1000 | 512 | 0.07042 | 0.33442 | 0.074 | 0.33206
6000 | 512 | 0.09931 | 0.34956 | 0.09895 | 0.3572
6272 | 512 | 0.10103 | 0.32976 | 0.10041 | 0.36635
200 | 1024 | 0.07144 | 0.33579 | 0.07209 | 0.33216
1000 | 1024 | 0.0736 | 0.32803 | 0.07286 | 0.32936
6000 | 1024 | 0.12584 | 0.38916 | 0.12852 | 0.48273
6272 | 1024 | 0.13053 | 0.38804 | 0.13464 | 0.49545
200 | 1536 | 0.07159 | 0.3396 | 0.07062 | 0.33545
1000 | 1536 | 0.07443 | 0.33239 | 0.07366 | 0.33204
6000 | 1536 | 0.14959 | 0.45043 | 0.15826 | 0.69119
6272 | 1536 | 0.1542 | 0.47644 | 0.18249 | 0.72208
200 | 2048 | 0.07258 | 0.33982 | 0.07412 | 0.33859
1000 | 2048 | 0.0793 | 0.32816 | 0.07864 | 0.32583
6000 | 2048 | 0.18973 | 0.571 | 0.25506 | 0.91796
6272 | 2048 | 0.19719 | 0.64208 | 0.26445 | 0.95055
200 | 3072 | 0.07092 | 0.33867 | 0.07104 | 0.34695
1000 | 3072 | 0.08727 | 0.33144 | 0.09144 | 0.36633
6000 | 3072 | 0.24683 | 0.87275 | 0.37761 | 1.3289
6272 | 3072 | 0.26437 | 0.91178 | 0.38496 | 1.53694
128 | 2097152 | 6.27936 | 23.69425 | 10.40004 | 30.13699
256 | 1048576 | 4.5404 | 19.47675 | 10.28494 | 29.36936
512 | 524288 | 4.13951 | 18.78771 | 10.09557 | 32.67083
1024 | 262144 | 4.47576 | 18.00411 | 9.56488 | 31.47117
2048 | 131072 | 4.28026 | 16.95619 | 9.40297 | 30.82845
4096 | 65536 | 4.2653 | 16.5018 | 9.03315 | 30.08392
8192 | 32768 | 4.25613 | 16.13583 | 8.9258 | 30.75296
16384 | 16384 | 4.20256 | 16.38207 | 9.52587 | 31.31113
32768 | 8192 | 4.20231 | 16.19452 | 9.31478 | 31.03514

</body>

</html>

---------

**Performance Improvement (%)**
<html xmlns:o="urn:schemas-microsoft-com:office:office"
xmlns:dt="uuid:C2F41010-65B3-11d1-A29F-00AA00C14882"
xmlns="http://www.w3.org/TR/REC-html40">

<head>

<meta name=ProgId content=OneNote.File>
<meta name=Generator content="Microsoft OneNote 15">
</head>

<body lang=en-US style='font-family:Calibri;font-size:11.0pt'>
<!--StartFragment-->

<div style='direction:ltr'>

M | N | fwdbwd,   torch.float16 | fwdbwd,   torch.float32
-- | -- | -- | --
50432 | 384 | 32.178 | 22.049
50176 | 384 | 29.231 | 19.536
200704 | 192 | 44.188 | 43.962
802816 | 64 | 52.119 | 54.100
200 | 256 | -5.750 | -0.206
1000 | 256 | 0.031 | -0.797
6000 | 256 | 3.566 | 5.621
6272 | 256 | 3.865 | 4.836
200 | 512 | -1.615 | -1.010
1000 | 512 | -1.270 | 0.208
6000 | 512 | 3.534 | 5.581
6272 | 512 | 7.905 | 7.483
200 | 1024 | -2.883 | 0.254
1000 | 1024 | -0.767 | 0.493
6000 | 1024 | 0.237 | -2.381
6272 | 1024 | 3.840 | -1.707
200 | 1536 | -0.127 | -1.340
1000 | 1536 | -0.711 | -0.992
6000 | 1536 | -0.209 | -4.728
6272 | 1536 | 0.508 | -0.846
200 | 2048 | -1.262 | -1.176
1000 | 2048 | -0.358 | 0.312
6000 | 2048 | 8.350 | 6.487
6272 | 2048 | 1.588 | 5.713
200 | 3072 | 0.223 | -0.848
1000 | 3072 | -0.773 | -5.743
6000 | 3072 | 3.570 | -3.783
6272 | 3072 | 4.962 | -4.092
128 | 2097152 | -4.266 | 0.348
256 | 1048576 | 0.397 | 0.185
512 | 524288 | 17.325 | 16.605
1024 | 262144 | 23.070 | 19.195
2048 | 131072 | 27.469 | 24.605
4096 | 65536 | 32.023 | 27.465
8192 | 32768 | 24.459 | 28.274
16384 | 16384 | 21.439 | 9.514
32768 | 8192 | 6.818 | 0.491

</div>

<!--EndFragment-->
</body>

</html>

---------
**Benchmark script of this PR**
```
# Ref:
#       1. pytorch#26201
#       2. pytorch#68238

from distutils.command.config import config
import torch
from torch.nn import LayerNorm
import timeit

number_runs = 1000  # TODO: Modify this to save time!
def test_forward(layer_norm_cuda, input_cuda):
    layer_norm_cuda(input_cuda); torch.cuda.synchronize()

def test_backward(out_cuda, layer_norm_grad_cuda, create_graph):
    out_cuda.backward(layer_norm_grad_cuda, retain_graph=True, create_graph=create_graph); torch.cuda.synchronize()

def test_fwdbwd(input_cuda, layer_norm_cuda, gO):
    input_cuda.grad = None
    layer_norm_cuda.zero_grad(set_to_none=True)
    out = layer_norm_cuda(input_cuda)
    out.backward(gO)
    torch.cuda.synchronize()

def benchmark(config_m, config_n):

    print("M | N | fwd (half) | fwdbwd (half) | fwd (float) | fwdbwd (float)")
    if len(config_m) != len(config_n):
        print("Please make sure the lengths of config_m and config_m are the same.")

    for i in range(len(config_m)):
        normalized_shape = config_n[i]
        results = [config_m[i], config_n[i]]
        for dtype in (torch.half, torch.float):
            if dtype == torch.half:
                layer_norm_cuda = LayerNorm(normalized_shape).half().cuda()
            else:
                layer_norm_cuda = LayerNorm(normalized_shape).cuda()

            input_cuda = torch.randn(config_m[i], config_n[i], device='cuda', dtype=dtype, requires_grad=True)

            # print("cuda forward:")
            result_fwd = timeit.timeit(lambda: test_forward(layer_norm_cuda, input_cuda), number=number_runs)
            results.append(result_fwd / number_runs * 1000)

            gO = torch.rand_like(input_cuda)

            result_fwdbwd = timeit.timeit(lambda: test_fwdbwd(input_cuda, layer_norm_cuda, gO), number=number_runs)
            results.append(result_fwdbwd / number_runs * 1000)

        print('{:09d}|{:09d}|{:9.5f}|{:9.5f}|{:9.5f}|{:9.5f}'.format(results[0], results[1], results[2], results[3], results[4], results[5]))

    print("Times are in microseconds (us).")

# CVT
config_m_cvt = [50432, 50176, 200704, 802816]
config_n_cvt = [384, 384, 192, 64]

# pytorch#68238 (comment)
config_m_68238 = [200, 1000, 6000, 6272, 200, 1000, 6000, 6272, 200, 1000, 6000, 6272, 200, 1000, 6000, 6272, 200, 1000, 6000, 6272, 200, 1000, 6000, 6272]
config_n_68238 = [256,256,256,256,512,512,512,512,1024,1024,1024,1024,1536,1536,1536,1536,2048,2048,2048,2048,3072,3072,3072,3072]

# pytorch#27634
config_m_27634 = [128, 256, 512, 1024, 2048, 4096, 8192, 16384, 32768]
config_n_27634 = [2097152, 1048576, 524288, 262144, 131072, 65536, 32768, 16384, 8192]

config_m = config_m_cvt + config_m_68238 + config_m_27634
config_n = config_n_cvt + config_n_68238 + config_n_27634

benchmark(config_m, config_n)
```

CC: @jeffdaily

Pull Request resolved: pytorch#87635
Approved by: https://github.com/jataylo, https://github.com/jeffdaily, https://github.com/ezyang
With both quantization/non-quantization supported.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.