NVFuser usability/correctness/performance #1350

ngimel · 2022-07-14T17:36:14Z

ngimel
Jul 14, 2022

Pytorch 1.12 release enables NVFuser by default in TorchScript, and also, combined with functorch 1.12 release, allows users to use aot_autograd compilation to potentially achieve better performance than with torchscript. However, users are experiencing issues when enabling these options, this discussion is to combine nvfuser-related problems in one place.
For a samples of problems encountered see comments in #1340.
cc @csarofeen, @jjsjann123, @Chillee.

rwightman · 2022-07-14T21:08:12Z

rwightman
Jul 14, 2022
Maintainer

I forget the combos that resulted in the autograd leaf variable errors, I'll run into them again soon... some other ones

The efficientnets seem to break in channels_last bwd, have to use fallback path

Running train benchmark on efficientnet_b3 for 40 steps w/ input size (3, 288, 288) and batch size 128.
/home/.../.conda/envs/pytorch-112n/lib/python3.10/site-packages/torch/autograd/__init__.py:173: UserWarning: FALLBACK path has been taken inside: runCudaFusionGroup. This is an indication that codegen Failed for some reason.
To debug try disable codegen fallback path via setting the env variable `export PYTORCH_NVFUSER_DISABLE=fallback`
 (Triggered internally at  /opt/conda/conda-bld/pytorch_1656352645774/work/torch/csrc/jit/codegen/cuda/manager.cpp:329.)
  Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass

I was doing some more hacking around with the LN looking for something that performs reasonably. The codegen on the diff of squared mean norm impl below fails completely (whether scripted in isolation via the decorator like this, or as part of whole model scripting) with an internal compiler error. This option is optimal on TPU w/ PT XLA and can actually be faster than BN (in train) when substituded in a familiar network. No PyTorch eager or torchscript / aot codegen impl of the NCHW LN can come close to that... (usually btw 1/3 to 1/2 the throughput and 2x the memory consumption)

Again in train / bwd,

Traceback of TorchScript (most recent call last):
RuntimeError: Graph::copy() encountered a use of a value 212 not in scope. Run lint!

@torch.jit.script
def _layer_norm_cf_sqm(x: torch.Tensor, weight: torch.Tensor, bias: torch.Tensor, eps: float):
    u = x.mean(dim=1, keepdim=True)
    s = ((x * x).mean(dim=1, keepdim=True) - (u * u)).clamp(0)
    x = (x - u) * torch.rsqrt(s + eps)
    x = x * weight[:, None, None] + bias[:, None, None]
    return x


class LayerNormExpC2d(nn.LayerNorm):
    r""" LayerNorm for channels_first tensors with 2d spatial dimensions (ie N, C, H, W).
    """

    def __init__(self, normalized_shape, eps=1e-6):
        super().__init__(normalized_shape, eps=eps)

    def forward(self, x) -> torch.Tensor:
        x = _layer_norm_cf_sqm(x, self.weight, self.bias, self.eps)
        return x

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NVFuser usability/correctness/performance #1350

{{title}}

Replies: 1 comment

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

NVFuser usability/correctness/performance #1350

ngimel Jul 14, 2022

Replies: 1 comment

rwightman Jul 14, 2022 Maintainer

ngimel
Jul 14, 2022

rwightman
Jul 14, 2022
Maintainer