Skip to content
This repository has been archived by the owner on Mar 20, 2024. It is now read-only.

Latest commit

 

History

History
260 lines (197 loc) · 9.61 KB

ediv.adoc

File metadata and controls

260 lines (197 loc) · 9.61 KB

Appendix A: Divided Element Extension (Extension Zvediv)

Warning
The EDIV extension is currently not planned to be part of the base "V" extension, and will change substantially from this current sketch and could be dropped completely. Do not use this specfication for any design work.
Note
This section has not been updated to account for new mask format in v0.9.

The divided element extension allows each element to be treated as a packed sub-vector of narrower elements. This provides efficient support for some forms of narrow-width and mixed-width arithmetic, and also to allow outer-loop vectorization of short vector and matrix operations. In addition to modifying the behavior of some existing instructions, a few new instructions are provided to operate on vectors when EDIV > 1.

The divided element extension adds a two-bit field, vediv[1:0] to the vtype register.

The vediv field encodes the number of ways, EDIV, into which each SEW-bit element is subdivided into equal sub-elements. A vector register group is now considered to hold a vector of sub-vectors.

vediv [1:0] Division EDIV

0

0

1

(undivided, as in base)

0

1

2

two equal sub-elements

1

0

4

four equal sub-elements

1

1

8

eight equal sub-elements

The assembly syntax for vsetvli has additional options added to encode the EDIV options.

 d1   # EDIV 1, assumed if d setting absent
 d2   # EDIV 2
 d4   # EDIV 4
 d8   # EDIV 8

 vsetvli t0, a0, e32, m2, d4   # SEW=32, LMUL=2, EDIV=4

SEW

EDIV

Sub-element

Integer accumulator

FP sum/dot accumulator

sum

dot

FLEN=32

FLEN=64

FLEN=128

8b

2

4b

8b

8b

-

-

-

8b

4

2b

8b

8b

-

-

-

8b

8

1b

8b

8b

-

-

-

16b

2

8b

16b

16b

-

-

-

16b

4

4b

8b

16b

-

-

-

16b

8

2b

8b

8b

-

-

-

32b

2

16b

32b

32b

32b

32b

32b

32b

4

8b

16b

32b

-

-

-

32b

8

4b

8b

16b

-

-

-

64b

2

32b

64b

64b

32b

64b

64b

64b

4

16b

32b

64b

32b

32b

32b

64b

8

8b

16b

32b

-

-

-

128b

2

64b

128b

128b

32b

64b

128b

128b

4

32b

64b

128b

32b

64b

64b

128b

8

16b

32b

64b

32b

32b

32b

256b

2

128b

256b

256b

32b

64b

128b

256b

4

64b

128b

256b

32b

64b

128b

256b

8

32b

64b

128b

32b

64b

64b

Each implementation defines a minimum size for a sub-element, SELEN, which must be at most 8 bits.

Note
While SELEN is a fourth implementation-specific parameter, values smaller than 8 would be considered an additional extension.

Instructions not affected by EDIV

The vector start register vstart and exception reporting continue to work as before.

The vector length vl control and vector masking continue to operate at the element level.

Vector masking continues to operate at the element level, so sub-elements cannot be individually masked.

Note
SEW can be changed dynamically to enabled per-element masking for sub-elements of 8 bits and greater.

Vector load/store and AMO instructions are unaffected by EDIV, and continue to move whole elements.

Vector mask logical operations are unchanged by EDIV setting, and continue to operate on vector registers containing element masks.

Vector mask population count (vpopc), find-first and related instructions (vfirst, vmsbf, vmsif, vmsof), iota (viota), and element index (vid) instructions are unaffected by EDIV.

Vector integer bit insert/extract, and integer and floating-point scalar move instruction are unaffected by EDIV.

Vector slide-up/slide-down are unaffected by EDIV.

Vector compress instructions are unaffected by EDIV.

Instructions Affected by EDIV

Regular Vector Arithmetic Instructions under EDIV

Most vector arithmetic operations are modified to operate on the individual sub-elements, so effective SEW is SEW/EDIV and effective vector length is vl * EDIV. For example, a vector add of 32-bit elements with a vl of 5 and EDIV of 4, operates identically to a vector add of 8-bit elements with a vector length of 20.

vsetvli t0, a0, e32, m1, d4  # Vectors of 32-bit elements, divided into byte sub-elements
vadd.vv v1,v2,v3                     # Performs a vector of 4*vl 8-bit additions.
vsll.vx v1,v2,x1                     # Performs a vector of 4*vl 8-bit shifts.

Vector Add with Carry/Subtract with Borrow Reserved under EDIV>1

For EDIV > 1, vadc, vmadc, vsbc, vmsbc are reserved.

Vector Reduction Instructions under EDIV

Vector single-width integer sum reduction instructions are reserved under EDIV>1. Other vector single-width reductions and vector widening integer sum reduction instructions now operate independently on all elements in a vector, reducing sub-element values within an element to an element-wide result.

The scalar input is taken from the least-significant bits of the second operand, with the number of bits equal to the number of significant result bits (i.e., for sum and dot reductions, the number of bits are given in table above, for non-sum and non-dot reductions, equal to the element size).

# Sum each sub-vector of four bytes into a 16-bit result.
vsetvli t0, a0, e32, d4  # Vectors of 32-bit elements, divided into byte sub-elements
vwredsum.vs v1, v2, v3 # v1[i][15:0] = v2[i][31:24] + v2[i][23:16]
                       #              + v2[i][15:8] + v2[i][7:0] + v3[i][15:0]

# Find maximum among sub-elements
vredmax.vs v5, v6, v7 # v5[i][7:0] = max(v6[i][31:24], v6[i][23:16],
                      #                    v6[i][15:8], v6[i][7:0], v7[i][7:0])

Integer sub-element non-sum reductions produce a final result that is max(8,SEW/EDIV) bits wide, sign- or zero-extended to full SEW if necessary.

Integer sub-element widening sum reductions produce a final result that is max(8,min(SEW,2*SEW/EDIV)) bits wide, sign- or zero-extended to full SEW if necessary.

Single-width floating-point reductions produce a final result that is SEW/EDIV bits wide.

Widening floating-point sum reductions produce a final result that is min(2*SEW/EDIV,FLEN) bits wide, NaN-boxed to the full SEW width if necessary.

Vector Register Gather Instructions under EDIV

Vector register gather instructions under non-zero EDIV only gather sub-elements within the element. The source and index values are interpreted as relative to the enclosing element only. Index values {ge} EDIV write a zero value into the result sub-element.

       |       |       |  SEW = 32b, EDIV=4
        7 6 5 4 3 2 1 0  bytes
        d e a d b e e f  v1
        0 1 9 2 0 2 3 2  v2
                            vrgather.vv v3, v1, v2
        d a 0 e f e b e  v3
                            vrgather.vi v4, v1, 1
        a a a a e e e e  v4
Note
Vector register gathers with scalar or immediate arguments can "splat" values across sub-elements within an element.
Note
Implementations can provide fast implementations of register gathers constrained within a single element width.

Vector Integer Dot-Product Instruction

The integer dot-product reduction vdot.vv performs an element-wise multiplication between the source sub-elements then accumulates the results into the destination vector element. Note the assembler syntax uses a .vv suffix since both inputs are vectors of elements.

Sub-element integer dot reductions produce a final result that is max(8,min(SEW,4*SEW/EDIV)) bits wide, sign- or zero-extended to full SEW if necessary.

# Unsigned dot-product
vdotu.vv vd, vs2, vs1, vm  # Vector-vector

# Signed dot-product
vdot.vv vd, vs2, vs1, vm   # Vector-vector
  # Dot product, SEW=32, EDIV=1
  vdot.vv  vd, vs2, vs1, vm   # vd[i][31:0] += vs2[i][31:0] * vs1[i][31:0]

  # Dot product, SEW=32, EDIV=2
  vdot.vv vd, vs2, vs1, vm # vd[i][31:0] += vs2[i][31:16] * vs1[i][31:16]
                                            + vs2[i][15:0] * vs1[i][15:0]

  # Dot product, SEW=32, EDIV=4
  vdot.vv vd, vs2, vs1, vm # vd[i][31:0] += vs2[i][31:24] * vs1[i][31:24]
                                            + vs2[i][23:16] * vs1[i][23:16]
                                            + vs2[i][15:8] * vs1[i][15:8]
                                            + vs2[i][7:0] * vs1[i][7:0]

Vector Floating-Point Dot Product Instruction

The floating-point dot-product reduction vfdot.vv performs an element-wise multiplication between the source sub-elements then accumulates the results into the destination vector element. Note the assembler syntax uses a .vv suffix since both inputs are vectors of elements.

# Signed dot-product
vfdot.vv vd, vs2, vs1, vm   # Vector-vector
# Dot product. SEW=32, EDIV=2
vfdot.vv  vd, vs2, vs1, vm # vd[i][31:0] += vs2[i][31:16] * vs1[i][31:16]
                                           + vs2[i][15:0] * vs1[i][15:0]

# Floating-point sub-vectors of two half-precision floats packed into 32-bit elements.
vsetvli t0, a0, e32, m1, d2  # Vectors of 32-bit elements, divided into 16b sub-elements
vfdot.vv v1, v2, v3   # v1[i][31:0] +=  v2[i][31:16]*v3[i][31:16] + v2[i][16:0]*v3[i][16:0]

# Floating-point sub-vectors of four half-precision floats packed into 64-bit elements.
vsetvli t0, a0, e64, m1, d4  # Vectors of 64-bit elements, divided into 16b sub-elements
vfdot.vv v1, v2, v3
                 # v1[i][31:0] +=  v2[i][31:16]*v3[i][31:16] + v2[i][16:0]*v3[i][16:0] +
                 #                 v2[i][63:48]*v3[i][63:48] + v2[i][47:32]*v3[i][47:32];
                 # v1[i][63:32] = ~0 (NaN boxing)