Skip to content

Commit

Permalink
[RFC] Specification of quantized reduction operation (openxla#1664)
Browse files Browse the repository at this point in the history
From the discussion emerged in
openxla#1538 about the specification
of quantized `reduce` op, we zeroed in to a proposal which need some
structural changes in the op (adding additional blocks). This proposed
RFC is to further explore the proposal in detail.

Please let me know your feedback. See also the [OpenXLA Discuss post](https://groups.google.com/a/openxla.org/g/openxla-discuss/c/iwE9is49SS4).
  • Loading branch information
sdasgup3 authored Oct 9, 2023
1 parent 6d662d0 commit f084a1a
Showing 1 changed file with 384 additions and 0 deletions.
384 changes: 384 additions & 0 deletions rfcs/20230622-quantized-reduction.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,384 @@
# RFC: StableHLO quantization for reduction ops

Status: Review<br/>
Initial version: 06/22/2023<br/>
updated: 07/13/2023: Minor refactoring of the examples.<br/>
Last updated: 08/11/2023: Revision of the proposal to introduce an
attribute to capture accumulation type.<br/>
Discussion thread: [GitHub](https://github.com/openxla/stablehlo/pull/1664)

## Version log

* 06/22/2023: Initial version.
* 07/13/2023: Fixed typo in code blocks, header indentation.
* 08/11/2023: Revision of the proposal to introduce an attribute to capture
accumulation type.
* 08/25/2023: The additional attribute is redundant.

## Introduction

The [reduce](https://github.com/openxla/stablehlo/blob/main/docs/spec.md#reduce)
op, for non-quantized types, has constraints like

```python
(C2) element_type(inputs...) = element_type(init_values...) = element_type(results...).
(C6) body has type tensor<E0>, ..., tensor<EN-1>, tensor<E0>, ..., tensor<EN-1>) -> (tensor<E0>, ..., tensor<EN-1>) where Ei = element_type(inputs[i]).
```

which constrained the signature of reduce op and its associated reducer function
`body` to have the same element types for `inputs`, `results` and arguments and
return for `body`. For reducer function performing an accumulative operation
like add, this means that the the result of accumulation can overflow in which
case the result will be implementation defined (e.g.,
[saturated](https://en.wikipedia.org/wiki/Saturation_arithmetic) or
[wrap around](https://en.wikipedia.org/wiki/Integer_overflow)). From
the conversation with customers it seems a reasonable behavior for non quantized
data types. However, with quantized data types, such loss in precision is not
acceptable and hence the motivation is to perform the accumulation in some
higher data type.

The RFC introduces the following proposal, emerged out of discussion in the
[thread](https://github.com/openxla/stablehlo/pull/1538#issuecomment-1599476906)
, along with their tradeoffs.

The proposal optionally allows the reducer block to express the computation in a
different element type (preferably wider accumulation type) than the one used in
reduce op's ops arguments and return type. For illustrative purposes, in the
following example, the operand element type `tensor<!quant.uniform<ui8:f32,
input_scale:input_zp>>` is different from the element type for reduction
region's block arguments. Similarly, the element type of the reduce op's
result `!quant.uniform<ui8:f32, output_scale:output_zp>>` is different from
that of block return (`tensor<!quant.uniform<i32:f32, accum_scale:accum_zp>>`).

```mlir
%result = "stablehlo.reduce"(%input, %init_value) ({
^reduce_computation(
%lhs: tensor<!quant.uniform<i32:f32, accum_scale:accum_zp>>,
%rhs: tensor<!quant.uniform<i32:f32, accum_scale:accum_zp>>):
%add = "stablehlo.add"(%lhs, %rhs)
: (tensor<!quant.uniform<i32:f32, accum_scale:accum_zp>>,
tensor<!quant.uniform<i32:f32, accum_scale:accum_zp>>)
-> tensor<!quant.uniform<i32:f32, accum_scale:accum_zp>>
"stablehlo.return"(%add)
: (tensor<!quant.uniform<i32:f32, accum_scale:accum_zp>>) -> ()
}) {
dimensions = dense<1> : tensor<i64>
} : (tensor<5 x 1 x !quant.uniform<ui8:f32, input_scale:input_zp>>,
tensor<!quant.uniform<ui8:f32, input_scale:input_zp>>)
-> tensor<5 x !quant.uniform<ui8:f32, output_scale:output_zp>>
```

### Semantics

If (1) the input operand type is different from the reduction block argument
type or (2) the op result type is different from the reduction block return
type, there will be an implicit type conversion defined by either
`stablehlo.convert`, `stablehlo.uniform_quantize`, or
`stablehlo.uniform_dequantize`. When the types are not differnet, i.e., when (1)
and (2) does not hold true, then no implicit convertion is needed.

For example,

| Implicit type conversion op | element type of operand or block return | element type of block argument or op return |
|-----------------------------------|-----------------------------------------|---------------------------------------------|
| (A) `stablehlo.uniform_quantize` | quantized tensor | quantized tensor |
| (B) `stablehlo.uniform_quantize` | floating point | quantized tensor |
| (C) `stablehlo.uniorm_dequantize` | quantized tensor | floating point |
| (D) `stablehlo.convert` | floating-point | integer |
| (E) `stablehlo.convert` | integer | floating-point |
| (F) `stablehlo.convert` | floating-point | floating-point |
| (G) `stablehlo.convert` | integer | integer |
| (H) `stablehlo.convert` | complex | complex |

At this point there is no use for cases other than (A), (F), and (G). My
proposal here would be to address (A), (F), and (G) only. Note that the (F)
partially addresses
[Decide on mixed precision](https://github.com/openxla/stablehlo/issues/369)
for reduce op in that it allows the input or init value to differ from the
corresponding block arguments w.r.t the precision of floating-point types.
However, the mixed precision implementation in HLO seems more detailed in the
following sense:

* [Decide on mixed precision](https://github.com/openxla/stablehlo/issues/369)
allows `inputs` and `init_values` to differ in floating-point precision.
Whereas, the current proposal considers them to have the same element type.
* [Decide on mixed precision](https://github.com/openxla/stablehlo/issues/369)
allows the element type of block arguments to differ from that of the block
return value. The current proposal considers them to have the same element type.
* There are other ops (than reduce) which need support for mixed precision (here
is the [list of ops](https://github.com/tensorflow/tensorflow/blob/1d69ba72834b963b72075a82c10959f6bb74e473/tensorflow/compiler/xla/service/hlo_verifier.cc#L1681-L1714)).

Having said that, my proposal would be to treat the above ticket separately.

## Appendix

To provide an estimate of specification changes needed to implement the
proposal, I have attempted to provide the blueprint here.

### Revised specification of reduce op

Here we include only the relevant portions of the spec with the proposed update.

#### Semantics

...

More formally, `results...[j0, ..., jR-1] =
reduce_implicit_convert(reduce(input_slices_converted),
type(func_outputs(body)...), type(results...)))` where:

* `input_slices = inputs...[j0, ..., :, ..., jR-1]`, where `:` are inserted
at `dimensions`.
* `input_slices_converted = reduce_implicit_convert(input_slices...,
type(inputs...), type(func_inputs(body)...)`.
* `reduce(input_slices_converted) = exec(schedule)` for some binary tree
`schedule` where:
* `exec(node) = body(exec(node.left), exec(node.right))`.
* `exec(leaf) = leaf.value`.
* `schedule` is an implementation-defined full binary tree whose in-order
traversal consists of:
* `input_slices_converted...[index]` values, for all `index` in
`index_space(input_slices_converted)` in the ascending lexicographic order
of `index`.
* Interspersed with an implementation-defined amount of
`reduce_implicit_convert(init_values..., type(init_values...), type(func_inputs(body)[:len(func_inputs(body)//2)])...)`
at implementation-defined positions.

#### Constraints

* (C?) `same(shape(inputs...))`.
* (C?) `element_type(inputs...) = element_type(init_values...)`.
* (C?) `baseline_element_type(inputs...) = baseline_element_type(results...)`.
* (C?) `body` has type `tensor<E0>, ..., tensor<EN-1>, tensor<E0>, ...,`
`tensor<EN-1>) -> (tensor<E0>, ..., tensor<EN-1>)` where
`is_integer(element_type(inputs[i])) = is_integer(element_type(E[i]))` or
`is_float(element_type(inputs[i])) = is_float(element_type(E[i]))` or
`is_complex(element_type(inputs[i])) = is_complex(element_type(E[i]))` or
`is_quantized(element_type(inputs[i])) = is_quantized(element_type(E[i]))`.
* (C?) `shape(results...) = shape(inputs...)` except that the dimension
sizes of `inputs...` corresponding to `dimensions` are not included.

`reduce_implicit_convert` is defined as

```python
def reduce_implicit_convert(x: Value, source_type: Type, destination_type:
Type):
if source_type == destination_type:
return x
if is_quantized(source_type) and is_quantized(destination_type):
return quantize(x, destination_type)
return convert(x, destination_type)
```

The above specification of `reduce` op can be used to define the specification
of other ops as shown below. As before, we are only presenting the relevant
portions of the spec which needs modification.

### Revised specification of reduce_window op

#### Constraints

* (C?) `element_type(inputs...) = element_type(init_values...)`.
* (C?) `baseline_element_type(inputs...) = baseline_element_type(results...)`.
* (C?) `body` has type `(tensor<E0>, ..., tensor<EN-1>, tensor<E0>, ...,`
`tensor<EN-1>) -> (tensor<E0>, ..., tensor<EN-1>)` where
`is_integer(element_type(inputs[i])) = is_integer(element_type(E[i]))` or
`is_float(element_type(inputs[i])) = is_float(element_type(E[i]))` or
`is_complex(element_type(inputs[i])) = is_complex(element_type(E[i]))` or
`is_quantized(element_type(inputs[i])) = is_quantized(element_type(E[i]))`.

### Revised specification of select_and_scatter op

This op originally takes two function arguments `select` and `scatter`. As the
`select` function is supposed to perform a non-accumulative operation, we may
not need additional conversion functions associated with `select`. But the
`scatter` function needs be accompanied with `input_conversion` and
`output_conversion` functions.

#### Constraints

<!-- markdownlint-disable line-length -->
* (C1) `element_type(operand) = element_type(source)`.
* (C3) `element_type(init_value) = element_type(operand)`.
* (C?) `baseline_element_type(inputs...) = baseline_element_type(results...)`.
* (C10) `scatter` has type `(tensor<E>, tensor<E>) -> tensor<E>` where
`is_integer(element_type(operand)) = is_integer(element_type(E))` or
`is_float(element_type(operand)) = is_float(element_type(E))` or
`is_complex(element_type(operand)) = is_complex(element_type(E))` or
`is_quantized(element_type(operand)) = is_quantized(element_type(E))`.
<!-- markdownlint-enable line-length -->

### Action Plan

I propose to follow the action plan (order matters):

* Update the specification of ReduceOp, ReduceWindowOp, and SelectAndScatterOp
op, taking the accumulation type into account, via [open
pr](https://github.com/openxla/stablehlo/pull/1538).
* Finalize the quantized specification of AllReduceOp, BatchNormTrainingOp,
BatchNormGradOp and ReduceScatterOp, whose semantics depend on ReduceOp,
via [open ticket](https://github.com/openxla/stablehlo/issues/1666).
* Spec the behavior of `precision_config` in DotGeneralOp. [open
issue](https://github.com/openxla/stablehlo/issues/755)
* Consider adding `precision_config` in reduction op. `precision_config`,
currently used for `dot_general` and `convolution`, to override the precision
specified by the input parameters, allowing the choice of low precision vs
high precision computation. We should consider adding `precision_config` to
all reduction based op as well. [need a ticket for this]
* Consider adding `accumulation_type` to `dot_general`/`convolution op`. The
attribute seems beneficial for ops like `dot_general` and `convolution` which
does not have an explicit reduction function. [need a ticket for this item].

## Summary of previous proposals

For completeness of the presentation, let me provide the proposals which are
evaluated previously and help shape the current proposal.

### Re-scale input to accumulation type

This option is the simplest from the POV for specification of quantized `reduce`
op. This is adding `stablehlo.uniform_quantize`ops before and after reduce op
which operates on the "accumulator" type.

```mlir
%widen = "stablehlo.uniform_quantize"(%input)
: (tensor<... x !quant.uniform<ui8:f32, ...>>) -> tensor<... x !quant.uniform<i32:f32, ...>>
%reduce = "stablehlo.reduce"(%widen) {
^reduce_computation(%lhs: !quant.uniform<i32:f32, ...>, %rhs: !qunat.uniform<i32:f32, ...>):
// reduce_computation_block
}
: (tensor<... x !quant.uniform<i32:f32, ...>>) -> tensor<... x !quant.uniform<i32:f32, ...>>
%narrowed = "stablehlo.uniform_quantize"(%reduce)
: (tensor<... x !quant.uniform<i32:f32, ...>>) -> tensor<... x !quant.uniform<ui8:f32, ...>>
```

#### Tradeoffs

* (+) An advantage of this option is that we only need minor changes to the
specification (i.e. to allow quantized types).
* (-) The compiler must pattern match 3 operations and map them into some
internal representation before their compilation or execution.
* (-) The compiler must ensure that the `stablehlo.uniform_quantize` (or
`stablehlo.convert` in the case of `bf16` or `f16`) is not folded before the
backend matches the pattern.
[for more information](https://github.com/openxla/stablehlo/pull/1538#issuecomment-1599476906)

This proposal should be avoided because it is hard to control the transformation
which might disrupt the pattern to be matched.

### Introduce on-the-fly type conversions

Proposes addition two regions in reduce op to (1) convert the input type to the
type of the `body` function argument and (2) convert the result type of the
`body` function to the output type. Following is the code snippet with the
proposed syntax of reduce op:

```mlir
%result = "stablehlo.reduce"(%input, %init_value) ({
^input_conversion(
%input: tensor<!quant.uniform<ui8:f32, input_scale:input_zp>>):
%input_rescaled = "stablehlo.uniform_quantize"(%input)
: (tensor<!quant.uniform<ui8:f32, input_scale:input_zp>>)
-> tensor<!quant.uniform<i32:f32, accum_scale:accum_zp>>
"stablehlo.return"(%input_rescaled)
: (tensor<!quant.uniform<i32:f32, accum_scale:accum_zp>>) -> ()
}, {
^reduce_computation(
%lhs: tensor<!quant.uniform<i32:f32, accum_scale:accum_zp>>,
%rhs: tensor<!quant.uniform<i32:f32, accum_scale:accum_zp>>):
%add = "stablehlo.add"(%lhs, %rhs)
: (tensor<!quant.uniform<i32:f32, accum_scale:accum_zp>>,
tensor<!quant.uniform<i32:f32, accum_scale:accum_zp>>)
-> tensor<!quant.uniform<i32:f32, accum_scale:accum_zp>>
"stablehlo.return"(%add)
: (tensor<!quant.uniform<i32:f32, accum_scale:accum_zp>>) -> ()
}, {
^output_conversion(
%intermediate_result: tensor<!quant.uniform<i32:f32, accum_scale:accum_zp>>):
%output_rescaled = "stablehlo.uniform_quantize"(%intermediate_result)
: (tensor<!quant.uniform<i32:f32, accum_scale:accum_zp>>)
-> tensor<!quant.uniform<ui8:f32, output_scale:output_zp>>
"stablehlo.return"(%output_rescaled)
: (tensor<!quant.uniform<ui8:f32, output_scale:output_zp>>) -> ()
}) {
dimensions = dense<...> : tensor<1xi64>
} : (tensor<... x !quant.uniform<ui8:f32, input_scale:input_zp>>,
tensor<... x !quant.uniform<ui8:f32, input_scale:input_zp>>)
-> tensor<... x !quant.uniform<ui8:f32, output_scale:output_zp>>
```

Here we will informally propose the semantics of the additional functions
`input_conversion` and `output_conversion` introduced.

```python
+----------+ +--------+ +--------+ +----------+ +--------+ +--------+
|init_value| |input[0]| |input[1]| |init_value| |input[2]| |input[3]|
+----------+ +--------+ +--------+ +----------+ +--------+ +--------+
| | | | | |
+----------+ +--------+ +--------+ +----------+ +--------+ +--------+
|input | |input | |input | |input | |input | |input |
|convert | |convert | |convert | |convert | |convert | |convert |
+----------+ +--------+ +--------+ +----------+ +--------+ +--------+
\ / / \ / /
+-------+ / +-------+ /
|compute| / |compute| /
+-------+ / +-------+ /
\ / \ /
+-------+ +-------+
|compute| |compute|
+-------+ +-------+
\___________ ___________/
\ /
+-------+
|compute|
+-------+
|
+-------+
|output |
|convert|
+-------+
```

### Tradeoffs

* (+) Enables programmers to program at (almost) baremetal. If the hardware
can support reduction computation in wider type (e.g. in the SIMD
instruction set, we typically do widening/compute/narrowing within the
kernel to save the memory bandwidth), the programmer can explicitly request
for that.
* (-) The disadvantage of this representation is that the syntax is more
verbose and requires significant changes to the specification.
* (-) The extra input/output conversion blocks are surplus information. The
intent of conversion blocks is to capture the accumulation type needed to
compute the accumulative operation on. The specification would benefit if the
intent can be expressed succinctly.

### Introduce accumulation type attribute

Instead of using additional input and output conversion blocks, use a type
attribute `accumulation type` to capture the accumulation type. As an example,

```mlir
%0 = stablehlo.reduce(%arg0 init: %arg1) across dimensions = [0] {
accumulation_type = tensor<!quant.uniform<i32:f32, 3.400000e+01:16>>
} : (tensor<16x!quant.uniform<i8:f32, 3.400000e+01:16>>, tensor<!quant.uniform<i8:f32, 3.400000e+01:16>>) -> tensor<!quant.uniform<i8:f32, 3.400000e+01:16>>
reducer(%arg2: tensor<!quant.uniform<i32:f32, 3.400000e+01:16>>, %arg3: tensor<!quant.uniform<i32:f32, 3.400000e+01:16>>) {
%1 = stablehlo.add %arg2, %arg3 : tensor<!quant.uniform<i32:f32, 3.400000e+01:16>>
stablehlo.return %1 : tensor<!quant.uniform<i32:f32, 3.400000e+01:16>>
}
// using tablegen specification like
// OptionalAttr<TypeAttrOf<HLO_Tensor>>:$accumulation_type
```

Note that the main difference between this option and the previous option is
that the input and output conversion blocks are no longer used and their intent
is specified via the `accumulation_type` attribute. However, the reducer block
needs to express the computation in accumulation type only.

This options is discarded because, for reduce op, the additional attribute seems
redundant and can be inferred based on the differences in element type of
operand and reduction block arguments (as described in the current proposal).

0 comments on commit f084a1a

Please sign in to comment.