- Introduction
- Type System
- Configuration-Setting and
vl
Argument - Naming Rules
- Exceptions in Naming
- Scalar in Vector Operations
- Mask in Intrinsics
- Masked Intrinsics Without MaskedOff
- Keep the Original Values of the Destination Vector
- SEW and LMUL of Intrinsics
- C Operators on RISC-V Vector Types
- Utility Functions
- Vector Initialization
- Reinterpret between floating point and integer types
- Reinterpret between signed and unsigned types
- Reinterpret between different SEWs under the same LMUL
- LMUL truncation and LMUL extension functions
- Vector Insertion and Extraction functions
- Utility Functions for Segment Load/Store Types
- Overloaded Interface
- Switching Vtype and Keep same VL in a Loop
- Programming Note
This document introduces the intrinsics for RISC-V vector programming, including the design decision we take, the type system, the general naming rules for intrinsics, and facilities for vector users. It does not list all available intrinsics for vector programming. The full set of intrinsics will be written in another document.
Encode SEW
and LMUL
into data types. We enforce the constraint LMUL ≥ SEW/ELEN
in the implementation. There are the following data types for ELEN
= 64.
Types | LMUL = 1 | LMUL = 2 | LMUL = 4 | LMUL = 8 | LMUL = 1/2 | LMUL = 1/4 | LMUL = 1/8 |
---|---|---|---|---|---|---|---|
int64_t | vint64m1_t | vint64m2_t | vint64m4_t | vint64m8_t | N/A | N/A | N/A |
uint64_t | vuint64m1_t | vuint64m2_t | vuint64m4_t | vuint64m8_t | N/A | N/A | N/A |
int32_t | vint32m1_t | vint32m2_t | vint32m4_t | vint32m8_t | vint32mf2_t | N/A | N/A |
uint32_t | vuint32m1_t | vuint32m2_t | vuint32m4_t | vuint32m8_t | vuint32mf2_t | N/A | N/A |
int16_t | vint16m1_t | vint16m2_t | vint16m4_t | vint16m8_t | vint16mf2_t | vint16mf4_t | N/A |
uint16_t | vuint16m1_t | vuint16m2_t | vuint16m4_t | vuint16m8_t | vuint16mf2_t | vuint16mf4_t | N/A |
int8_t | vint8m1_t | vint8m2_t | vint8m4_t | vint8m8_t | vint8mf2_t | vint8mf4_t | vint8mf8_t |
uint8_t | vuint8m1_t | vuint8m2_t | vuint8m4_t | vuint8m8_t | vuint8mf2_t | vuint8mf4_t | vuint8mf8_t |
vfloat64 | vfloat64m1_t | vfloat64m2_t | vfloat64m4_t | vfloat64m8_t | N/A | N/A | N/A |
vfloat32 | vfloat32m1_t | vfloat32m2_t | vfloat32m4_t | vfloat32m8_t | vfloat32mf2_t | N/A | N/A |
vfloat16 | vfloat16m1_t | vfloat16m2_t | vfloat16m4_t | vfloat16m8_t | vfloat16mf2_t | vfloat16mf4_t | N/A |
There are the following data types for ELEN
= 32.
Types | LMUL = 1 | LMUL = 2 | LMUL = 4 | LMUL = 8 | LMUL = 1/2 | LMUL = 1/4 | LMUL = 1/8 |
---|---|---|---|---|---|---|---|
int32_t | vint32m1_t | vint32m2_t | vint32m4_t | vint32m8_t | vint32mf2_t | N/A | N/A |
uint32_t | vuint32m1_t | vuint32m2_t | vuint32m4_t | vuint32m8_t | vuint32mf2_t | N/A | N/A |
int16_t | vint16m1_t | vint16m2_t | vint16m4_t | vint16m8_t | vint16mf2_t | vint16mf4_t | N/A |
uint16_t | vuint16m1_t | vuint16m2_t | vuint16m4_t | vuint16m8_t | vuint16mf2_t | vuint16mf4_t | N/A |
int8_t | vint8m1_t | vint8m2_t | vint8m4_t | vint8m8_t | vint8mf2_t | vint8mf4_t | vint8mf8_t |
uint8_t | vuint8m1_t | vuint8m2_t | vuint8m4_t | vuint8m8_t | vuint8mf2_t | vuint8mf4_t | vuint8mf8_t |
vfloat32 | vfloat32m1_t | vfloat32m2_t | vfloat32m4_t | vfloat32m8_t | vfloat32mf2_t | N/A | N/A |
vfloat16 | vfloat16m1_t | vfloat16m2_t | vfloat16m4_t | vfloat16m8_t | vfloat16mf2_t | vfloat16mf4_t | N/A |
Encode the ratio of SEW
/LMUL
into the mask types. There are the following mask types.
n = SEW
/LMUL
Types | n = 1 | n = 2 | n = 4 | n = 8 | n = 16 | n = 32 | n = 64 |
---|---|---|---|---|---|---|---|
bool | vbool1_t | vbool2_t | vbool4_t | vbool8_t | vbool16_t | vbool32_t | vbool64_t |
Under the constraint LMUL
* NR
≤ 8.
SEGMENT_TYPE ::= 'v' TYPE LMUL 'x' NR '_t'
TYPE ::= ( 'int8' | 'int16' | 'int32' | 'int64' |
'uint8' | 'uint16' | 'uint32' | 'uint64' |
'float16' | 'float32' | 'float64' )
LMUL ::= ( m1 | m2 | m4 | m8 | mf2 | mf4 | mf8 )
NR ::= ( 2 | 3 | 4 | 5 | 6 | 7 | 8 )
int8_t:
NF \ LMUL | LMUL = 1/8 | LMUL = 1/4 | LMUL = 1/2 | LMUL = 1 | LMUL = 2 | LMUL = 4 | LMUL = 8 |
---|---|---|---|---|---|---|---|
2 | vint8mf8x2_t | vint8mf4x2_t | vint8mf2x2_t | vint8m1x2_t | vint8m2x2_t | vint8m4x2_t | N/A |
3 | vint8mf8x3_t | vint8mf4x3_t | vint8mf2x3_t | vint8m1x3_t | vint8m2x3_t | N/A | N/A |
4 | vint8mf8x4_t | vint8mf4x4_t | vint8mf2x4_t | vint8m1x4_t | vint8m2x4_t | N/A | N/A |
5 | vint8mf8x5_t | vint8mf4x5_t | vint8mf2x5_t | vint8m1x5_t | N/A | N/A | N/A |
6 | vint8mf8x6_t | vint8mf4x6_t | vint8mf2x6_t | vint8m1x6_t | N/A | N/A | N/A |
7 | vint8mf8x7_t | vint8mf4x7_t | vint8mf2x7_t | vint8m1x7_t | N/A | N/A | N/A |
8 | vint8mf8x8_t | vint8mf4x8_t | vint8mf2x8_t | vint8m1x8_t | N/A | N/A | N/A |
uint8_t:
NF \ LMUL | LMUL = 1/8 | LMUL = 1/4 | LMUL = 1/2 | LMUL = 1 | LMUL = 2 | LMUL = 4 | LMUL = 8 |
---|---|---|---|---|---|---|---|
2 | vuint8mf8x2_t | vuint8mf4x2_t | vuint8mf2x2_t | vuint8m1x2_t | vuint8m2x2_t | vuint8m4x2_t | N/A |
3 | vuint8mf8x3_t | vuint8mf4x3_t | vuint8mf2x3_t | vuint8m1x3_t | vuint8m2x3_t | N/A | N/A |
4 | vuint8mf8x4_t | vuint8mf4x4_t | vuint8mf2x4_t | vuint8m1x4_t | vuint8m2x4_t | N/A | N/A |
5 | vuint8mf8x5_t | vuint8mf4x5_t | vuint8mf2x5_t | vuint8m1x5_t | N/A | N/A | N/A |
6 | vuint8mf8x6_t | vuint8mf4x6_t | vuint8mf2x6_t | vuint8m1x6_t | N/A | N/A | N/A |
7 | vuint8mf8x7_t | vuint8mf4x7_t | vuint8mf2x7_t | vuint8m1x7_t | N/A | N/A | N/A |
8 | vuint8mf8x8_t | vuint8mf4x8_t | vuint8mf2x8_t | vuint8m1x8_t | N/A | N/A | N/A |
int16_t:
NF \ LMUL | LMUL = 1/8 | LMUL = 1/4 | LMUL = 1/2 | LMUL = 1 | LMUL = 2 | LMUL = 4 | LMUL = 8 |
---|---|---|---|---|---|---|---|
2 | N/A | vint16mf4x2_t | vint16mf2x2_t | vint16m1x2_t | vint16m2x2_t | vint16m4x2_t | N/A |
3 | N/A | vint16mf4x3_t | vint16mf2x3_t | vint16m1x3_t | vint16m2x3_t | N/A | N/A |
4 | N/A | vint16mf4x4_t | vint16mf2x4_t | vint16m1x4_t | vint16m2x4_t | N/A | N/A |
5 | N/A | vint16mf4x5_t | vint16mf2x5_t | vint16m1x5_t | N/A | N/A | N/A |
6 | N/A | vint16mf4x6_t | vint16mf2x6_t | vint16m1x6_t | N/A | N/A | N/A |
7 | N/A | vint16mf4x7_t | vint16mf2x7_t | vint16m1x7_t | N/A | N/A | N/A |
8 | N/A | vint16mf4x8_t | vint16mf2x8_t | vint16m1x8_t | N/A | N/A | N/A |
uint16_t:
NF \ LMUL | LMUL = 1/8 | LMUL = 1/4 | LMUL = 1/2 | LMUL = 1 | LMUL = 2 | LMUL = 4 | LMUL = 8 |
---|---|---|---|---|---|---|---|
2 | N/A | vuint16mf4x2_t | vuint16mf2x2_t | vuint16m1x2_t | vuint16m2x2_t | vuint16m4x2_t | N/A |
3 | N/A | vuint16mf4x3_t | vuint16mf2x3_t | vuint16m1x3_t | vuint16m2x3_t | N/A | N/A |
4 | N/A | vuint16mf4x4_t | vuint16mf2x4_t | vuint16m1x4_t | vuint16m2x4_t | N/A | N/A |
5 | N/A | vuint16mf4x5_t | vuint16mf2x5_t | vuint16m1x5_t | N/A | N/A | N/A |
6 | N/A | vuint16mf4x6_t | vuint16mf2x6_t | vuint16m1x6_t | N/A | N/A | N/A |
7 | N/A | vuint16mf4x7_t | vuint16mf2x7_t | vuint16m1x7_t | N/A | N/A | N/A |
8 | N/A | vuint16mf4x8_t | vuint16mf2x8_t | vuint16m1x8_t | N/A | N/A | N/A |
int32_t:
NF \ LMUL | LMUL = 1/8 | LMUL = 1/4 | LMUL = 1/2 | LMUL = 1 | LMUL = 2 | LMUL = 4 | LMUL = 8 |
---|---|---|---|---|---|---|---|
2 | N/A | N/A | vint32mf2x2_t | vint32m1x2_t | vint32m2x2_t | vint32m4x2_t | N/A |
3 | N/A | N/A | vint32mf2x3_t | vint32m1x3_t | vint32m2x3_t | N/A | N/A |
4 | N/A | N/A | vint32mf2x4_t | vint32m1x4_t | vint32m2x4_t | N/A | N/A |
5 | N/A | N/A | vint32mf2x5_t | vint32m1x5_t | N/A | N/A | N/A |
6 | N/A | N/A | vint32mf2x6_t | vint32m1x6_t | N/A | N/A | N/A |
7 | N/A | N/A | vint32mf2x7_t | vint32m1x7_t | N/A | N/A | N/A |
8 | N/A | N/A | vint32mf2x8_t | vint32m1x8_t | N/A | N/A | N/A |
uint32_t:
NF \ LMUL | LMUL = 1/8 | LMUL = 1/4 | LMUL = 1/2 | LMUL = 1 | LMUL = 2 | LMUL = 4 | LMUL = 8 |
---|---|---|---|---|---|---|---|
2 | N/A | N/A | vuint32mf2x2_t | vuint32m1x2_t | vuint32m2x2_t | vuint32m4x2_t | N/A |
3 | N/A | N/A | vuint32mf2x3_t | vuint32m1x3_t | vuint32m2x3_t | N/A | N/A |
4 | N/A | N/A | vuint32mf2x4_t | vuint32m1x4_t | vuint32m2x4_t | N/A | N/A |
5 | N/A | N/A | vuint32mf2x5_t | vuint32m1x5_t | N/A | N/A | N/A |
6 | N/A | N/A | vuint32mf2x6_t | vuint32m1x6_t | N/A | N/A | N/A |
7 | N/A | N/A | vuint32mf2x7_t | vuint32m1x7_t | N/A | N/A | N/A |
8 | N/A | N/A | vuint32mf2x8_t | vuint32m1x8_t | N/A | N/A | N/A |
int64_t:
NF \ LMUL | LMUL = 1/8 | LMUL = 1/4 | LMUL = 1/2 | LMUL = 1 | LMUL = 2 | LMUL = 4 | LMUL = 8 |
---|---|---|---|---|---|---|---|
2 | N/A | N/A | N/A | vint64m1x2_t | vint64m2x2_t | vint64m4x2_t | N/A |
3 | N/A | N/A | N/A | vint64m1x3_t | vint64m2x3_t | N/A | N/A |
4 | N/A | N/A | N/A | vint64m1x4_t | vint64m2x4_t | N/A | N/A |
5 | N/A | N/A | N/A | vint64m1x5_t | N/A | N/A | N/A |
6 | N/A | N/A | N/A | vint64m1x6_t | N/A | N/A | N/A |
7 | N/A | N/A | N/A | vint64m1x7_t | N/A | N/A | N/A |
8 | N/A | N/A | N/A | vint64m1x8_t | N/A | N/A | N/A |
uint64_t:
NF \ LMUL | LMUL = 1/8 | LMUL = 1/4 | LMUL = 1/2 | LMUL = 1 | LMUL = 2 | LMUL = 4 | LMUL = 8 |
---|---|---|---|---|---|---|---|
2 | N/A | N/A | N/A | vuint64m1x2_t | vuint64m2x2_t | vuint64m4x2_t | N/A |
3 | N/A | N/A | N/A | vuint64m1x3_t | vuint64m2x3_t | N/A | N/A |
4 | N/A | N/A | N/A | vuint64m1x4_t | vuint64m2x4_t | N/A | N/A |
5 | N/A | N/A | N/A | vuint64m1x5_t | N/A | N/A | N/A |
6 | N/A | N/A | N/A | vuint64m1x6_t | N/A | N/A | N/A |
7 | N/A | N/A | N/A | vuint64m1x7_t | N/A | N/A | N/A |
8 | N/A | N/A | N/A | vuint64m1x8_t | N/A | N/A | N/A |
float16:
NF \ LMUL | LMUL = 1/8 | LMUL = 1/4 | LMUL = 1/2 | LMUL = 1 | LMUL = 2 | LMUL = 4 | LMUL = 8 |
---|---|---|---|---|---|---|---|
2 | N/A | vfloat16mf4x2_t | vfloat16mf2x2_t | vfloat16m1x2_t | vfloat16m2x2_t | vfloat16m4x2_t | N/A |
3 | N/A | vfloat16mf4x3_t | vfloat16mf2x3_t | vfloat16m1x3_t | vfloat16m2x3_t | N/A | N/A |
4 | N/A | vfloat16mf4x4_t | vfloat16mf2x4_t | vfloat16m1x4_t | vfloat16m2x4_t | N/A | N/A |
5 | N/A | vfloat16mf4x5_t | vfloat16mf2x5_t | vfloat16m1x5_t | N/A | N/A | N/A |
6 | N/A | vfloat16mf4x6_t | vfloat16mf2x6_t | vfloat16m1x6_t | N/A | N/A | N/A |
7 | N/A | vfloat16mf4x7_t | vfloat16mf2x7_t | vfloat16m1x7_t | N/A | N/A | N/A |
8 | N/A | vfloat16mf4x8_t | vfloat16mf2x8_t | vfloat16m1x8_t | N/A | N/A | N/A |
float32:
NF \ LMUL | LMUL = 1/8 | LMUL = 1/4 | LMUL = 1/2 | LMUL = 1 | LMUL = 2 | LMUL = 4 | LMUL = 8 |
---|---|---|---|---|---|---|---|
2 | N/A | N/A | vfloat32mf2x2_t | vfloat32m1x2_t | vfloat32m2x2_t | vfloat32m4x2_t | N/A |
3 | N/A | N/A | vfloat32mf2x3_t | vfloat32m1x3_t | vfloat32m2x3_t | N/A | N/A |
4 | N/A | N/A | vfloat32mf2x4_t | vfloat32m1x4_t | vfloat32m2x4_t | N/A | N/A |
5 | N/A | N/A | vfloat32mf2x5_t | vfloat32m1x5_t | N/A | N/A | N/A |
6 | N/A | N/A | vfloat32mf2x6_t | vfloat32m1x6_t | N/A | N/A | N/A |
7 | N/A | N/A | vfloat32mf2x7_t | vfloat32m1x7_t | N/A | N/A | N/A |
8 | N/A | N/A | vfloat32mf2x8_t | vfloat32m1x8_t | N/A | N/A | N/A |
float64:
NF \ LMUL | LMUL = 1/8 | LMUL = 1/4 | LMUL = 1/2 | LMUL = 1 | LMUL = 2 | LMUL = 4 | LMUL = 8 |
---|---|---|---|---|---|---|---|
2 | N/A | N/A | N/A | vfloat64m1x2_t | vfloat64m2x2_t | vfloat64m4x2_t | N/A |
3 | N/A | N/A | N/A | vfloat64m1x3_t | vfloat64m2x3_t | N/A | N/A |
4 | N/A | N/A | N/A | vfloat64m1x4_t | vfloat64m2x4_t | N/A | N/A |
5 | N/A | N/A | N/A | vfloat64m1x5_t | N/A | N/A | N/A |
6 | N/A | N/A | N/A | vfloat64m1x6_t | N/A | N/A | N/A |
7 | N/A | N/A | N/A | vfloat64m1x7_t | N/A | N/A | N/A |
8 | N/A | N/A | N/A | vfloat64m1x8_t | N/A | N/A | N/A |
There are two variants of configuration setting intrinsics. vsetvl
is used to
get the active vector length (vl
) according to the given application vector
length(AVL
), SEW
and LMUL
.
vl
register status is not expose to C language level, so in theory you can
treat vsetvl
and vsetvlmax
functions are return the min value for avl
and
VLMAX
.
size_t vsetvl_e8m1 (size_t avl);
size_t vsetvl_e8m2 (size_t avl);
size_t vsetvl_e8m4 (size_t avl);
size_t vsetvl_e8m8 (size_t avl);
size_t vsetvlmax_e8m1 ();
size_t vsetvlmax_e8m2 ();
size_t vsetvlmax_e8m4 ();
size_t vsetvlmax_e8m8 ();
There is no need to specify the behavior of tail and masked-off elements being undisturbed or agnostic. The default setting is tail agnostic and masked-off undisturbed. If users do not want to keep the values in masked-off elements, they could pass vundefined()
as the maskedoff
value.
SEW
and LMUL
are a part of the naming. They are static information for the
intrinsics.
All of the intrinsic functions have a vl
argument to specify the active
vector length, except a few functions operate regardless of vl
. e.g. vmv.x.s
,
vfmv.f.s
, vundefined
, vreinterpret
, vlmul_ext
, vlmul_trunc
, vget
,
vset
and vcreate
.
The intrinsic functions will only operate at most VLMAX
element if the vl
argument are larger than VLMAX
.
The semantics of following two snippets are equivalent. We strongly suggest the first form.
size_t vl = vsetvl_e8m1 (avl);
vint8m1_t va, vb, vc;
va = vadd_vv_i8m1(vb, vc, vl);
vint8m1_t va, vb, vc;
va = vadd_vv_i8m1(vb, vc, avl);
Intrinsics is the interface to the low level assembly in high level programming language. The intrinsic API has the goal to make all the V-ext instructions accessible from C/C++. The intrinsic names are as close as the assembly mnemonics. Besides the basic intrinsics corresponding to assembly mnemonics, there are intrinsics close to semantic naming.
The intrinsic names will encode return type if it is appropriate. It is easier to know the output type of the intrinsics from the name. In addition, if the intrinsic call is used as the operand, having the return type is more immediate. If there is no return value, the intrinsics will encode the input value types. If the return type is the same, use exceptional rules to differentiate them. See Exceptions in Naming.
In general, the naming rule of intrinsics is
INTRINSIC ::= MNEMONIC '_' RET_TYPE
MNEMONIC ::= Instruction name in v-ext specification. Replace '.' with '_'.
RET_TYPE ::= SEW LMUL
SEW ::= ( i8 | i16 | i32 | i64 | u8 | u16 | u32 | u64 | f16 | f32 | f64 )
LMUL ::= ( m1 | m2 | m4 | m8 | mf2 | mf4 | mf8 )
Example:
vadd.vv vd, vs2, vs1:
vint8m1_t vadd_vv_i8m1(vint8m1_t vs2, vint8m1_t vs1, size_t vl);
vwaddu.vv vd, vs2, vs1:
vint16m2_t vwaddu_vv_i16m2(vint8m1_t vs2, vint8m1_t vs1, size_t vl);
If intrinsics have the same return type under different input types, we could not use general naming rules directly on these intrinsics. It will cause the same intrinsic names for different input types.
This section lists all exceptional cases for intrinsic naming.
It does not encode return type into vector store. There is no return data for store operations. Instead, use the type of store data to name the intrinsics.
Example:
vse8.v vs3, (rs1):
void vse8_v_i8m1(int8_t *rs1, vint8m1_t vs3, size_t vl);
The result of vmadc
and vmsbc
is mask types. Becuase we use the ratio SEW
/LMUL
to name the mask types and multiple (SEW
, LMUL
) pairs map to the same ratio, in addition to use the return type to name the intrinsics, we also encode the input types to distinguish these intrinsics.
Example:
vmadc.vv vd, vs2, vs1:
vbool8_t vmadc_vv_i8m1_b8(vint8m1_t vs2, vint8m1_t vs1, size_t vl);
The result of comparison instructions is mask types. Becuase we use the ratio SEW
/LMUL
to name the mask types and multiple (SEW
, LMUL
) pairs map to the same ratio, in addition to use the return type to name the intrinsics, we also encode the input types to distinguish these intrinsics.
Example:
vmseq.vv vd, vs2, vs1:
vbool8_t vmseq_vv_i8m1_b8(vint8m1_t vs2, vint8m1_t vs1, size_t vl);
vbool8_t vmseq_vv_i16m2_b8(vint16m2_t vs2, vint16m2_t vs1, size_t vl);
The scalar input and output operands are held in element 0 of a single vector register. Use LMUL = 1 in the return type. To distinguish different intrinsics with different input types, encode the input type and the result type in the name.
Example:
vredsum.vs vd, vs2, vs1:
vint8m1_t vredsum_vs_i8m1_i8m1(vint8m1_t dest, vint8m1_t vs2, vint8m1_t vs1, size_t vl);
vint8m1_t vredsum_vs_i8m2_i8m1(vint8m1_t dest, vint8m2_t vs2, vint8m1_t vs1, size_t vl);
vint8m1_t vredsum_vs_i8m4_i8m1(vint8m1_t dest, vint8m4_t vs2, vint8m1_t vs1, size_t vl);
vint8m1_t vredsum_vs_i8m8_i8m1(vint8m1_t dest, vint8m8_t vs2, vint8m1_t vs1, size_t vl);
The return type of vpopc.m
and vfirst.m
is apparently an integer. Do not encode the return type into it. Instead, encode the input type to it.
Example:
vpopc.m rd, vs2:
unsigned long vpopc_m_b1(vbool1_t vs2, size_t vl);
unsigned long vpopc_m_b2(vbool2_t vs2, size_t vl);
To move the element 0 of a vector to a scalar, encode the input vector type and the output scalar type.
Example:
vmv.x.s rd, vs2:
int8_t vmv_x_s_i8m1_i8 (vint8m1_t vs2, size_t vl);
int8_t vmv_x_s_i8m2_i8 (vint8m2_t vs2, size_t vl);
int8_t vmv_x_s_i8m4_i8 (vint8m4_t vs2, size_t vl);
int8_t vmv_x_s_i8m8_i8 (vint8m8_t vs2, size_t vl);
In V specification, it defines operations between vector and scalar types. If XLEN
> SEW
, the least-significant SEW bits of the scalar register are used. If XLEN
< SEW
, the value from the scalar register is sign-extended to SEW bits.
We define arithmetic intrinsics with scalar using SEW types.
Example:
// Use uint8_t for op2.
vuint8m1_t vadd_vx_u8m1(vuint8m1_t op1, uint8_t op2, size_t vl);
// Use uint64_t for op2.
vuint64m1_t vadd_vx_u64m1(vuint64m1_t op1, uint64_t op2, size_t vl);
The compiler may generate multiple instructions for the intrinsics. For example, it is a little bit complicated to support uint64_t
operations under XLEN
= 32.
It breaks the one-to-one mapping between intrinsics and assembly mnemonics in some hardware configurations. However, it makes more sense for users to use the scalar types consistent with the SEW
of vector types.
There is the same issue for vmv.x.s
, vmv.s.x
, vfmv.f.s
, vfmv.s.f
, vslide1up.vx
, vfslide1up.vf
, vslide1down.vx
, and vfslide1down.vx
. Use SEW
to encode the scalar type.
Example:
// Use uint8_t for op2.
vuint8m1_t vslide1up_vx_u8m1(vuint8m1_t dest, vuint8m1_t op1, uint8_1 op2);
RISC-V "V" extension only has "merge in output" semantic. Intrinsics with mask has two additional arguments, mask
and maskedoff
.
vd = vop(mask, maskedoff, arg1, arg2)
vd[i] = maskedoff[i], if mask[i] == 0
vd[i] = vop(arg1[i], arg2[2]), if mask[i] == 1
In general, the naming rule of intrinsic with mask v0.t is
INTRINSIC_WITH_MASK ::= INTRINSIC '_m'
Example:
vadd.vv vd, vs2, vs1, v0.t:
vint8m1_t vadd_vv_i8m1_m(vbool8_t mask, vint8m1_t maskedoff, vint8m1_t vs2, vint8m1_t vs1, size_t vl);
If the intrinsics are always masked, there is no need to append _m
to the intrinsic. For example, the vmerge
instructions are always masked.
Example:
vmerge.vvm vd, vs2, vs1, v0:
vint8m1_t vmerge_vvm_i8m1(vbool8_t mask, vint8m1_t vs2, vint8m1_t vs1, size_t vl);
vcompress.vm vd, vs2, vs1:
vint8m1_t vcompress_vm_i8m1(vbool8_t vs1, vint8m1_t maskedoff, vint8m1_t vs2, size_t vl);
There are two additional masking semantics: zero in output semantic and don't care in output semantic. Users could leverage merge in output intrinsics to simulate these two additional masking semantics.
Example:
// Don't care in output semantic
vint8m1_t vadd_vv_i8m1_m(vbool8_t mask, vundefined_i8m1(), vint8m1_t vs2, vint8m1_t vs1, size_t vl);
There is no maskedoff
argument for store operations. The value of maskedoff
already exists in memory.
Example:
vse8.v vs3, (rs1), v0.t:
void vse8_v_i8m1_m(vbool8_t mask, int8_t *rs1, vint8m1_t vs3, size_t vl);
The result of reductions is put in element 0 of the output vector. There is no maskedoff
argument for reduction operations.
Example:
vredsum.vs vd, vs2, vs1, v0.t:
vint8m1_t vredsum_vs_i8m2_i8m1_m(vbool4_t mask, vint8m1_t dest, vint8m2_t vs2, vint8m1_t vs1, size_t vl);
The result of merge operations comes from their two source operands. Merge intrinsics have no maskedoff
argument.
Example:
vmerge.vvm vd, vs2, vs1, v0:
vint8m1_t vmerge_vvm_i8m1_m(vbool8_t mask, vint8m1_t vs2, vint8m1_t vs1, size_t vl);
vmv.s.x
and reduction operations will only modify the first element of the destination vector. Users could keep the original values of the remaining elements in the destination vector through dest
argument in these intrinsics.
Vector slide instructions also have unchanged parts in the destination register group. Users could keep the original values of the unchanged parts in the destination vector group through dest
argument in the intrinsics.
Example:
vint8m1_t vmv_s_x_i8m1(vint8m1_t dest, int8_t src, size_t vl);
vint8m1_t vredsum_vs_i8m1_i8m1(vint8m1_t dest, vint8m1_t vs2, vint8m1_t vs1, size_t vl);
vint8m1_t vredsum_vs_i8m2_i8m1_m(vbool4_t mask, vint8m1_t dest, vint8m2_t vs2, vint8m1_t vs1, size_t vl);
vuint8m1_t vslide1up_vx_u8m1(vuint8m1_t dest, vuint8m1_t op1, uint8_1 op2, size_t vl);
SEW
and LMUL
are the static information for the intrinsics. The compiler will generate vsetvli when vtype is changed between operations.
Example:
vint8m1_t a, b, c, d;
vint16m2_t a2, b2, c2;
...
a2 = vwadd_vv_i16m2(a, b, vl);
b2 = vwadd_vv_i16m2(c, d, vl);
c2 = vadd_vv_i16m2(a2, b2, vl);
It will generate the following instructions.
vsetvli x0, vl, e8,m1
vwadd.vv a2, a, b
vwadd.vv b2, c, d
vsetvli x0, vl, e16,m2
vadd.vv c2, a2, b2
Be aware that when the ratio of LMUL/SEW
is changed, users need to ensure the vl
is correct for the following operations if using implicit vl intrinsics.
The semantic of C builtin operators, other than simple assignment, hasn't been decided yet. Simple assignment keeps the usual C semantics of storing the value on the right operand into the variable of the left operand.
This section lists all utility functions to help users program in V intrinsics easier.
These utility functions are used to initialize vector values. They could be used in masking intrinsics with don't care in output semantics.
Example:
vint8m1_t vundefined_i8m1()
Note: Any operation with vundefined_* ()
are undefined and unpredictable,
the only recommended usage is used as maskedoff operand; an additional note is
any operation with vundefined_* ()
will got unpredictable result, e.g.
vxor(vundefined (), vundefined ())
and vec a = vundefined(); vec b = vxor(a, a);
both are not guarantee result vector with zeros.
These utility functions help users to convert types between floating point and integer types. The reinterpreter intrinsics only change the types of underlying contents. It is a nop operation.
Example:
// Convert floating point to signed integer types.
vint64m1_t vreinterpret_v_f64m1_i64m1(vfloat64m1_t src)
// Convert floating point to unsigned integer types.
vuint64m1_t vreinterpret_v_f64m1_u64m1(vfloat64m1_t src);
These utility functions help users to convert types between signed and unsigned types. The reinterpreter intrinsics only change the types of underlying contents. It is a nop operation.
Example:
// Convert signed to unsigned types.
vuint8m1_t vreinterpret_v_i8m1_u8m1(vint8m1_t src)
These utility functions help users to convert types between SEW
s under the same LMUL
, e.g., convert vint32m1_t to vint64m1_t. The reinterpreter intrinsics only change the types of underlying contents. It is a nop operation. It will generate vsetvli
by the following vector operation for the new type.
Example:
// Convert SEW under the same LMUL.
vint64m1_t vreinterpret_v_i32m1_i64m1(vint32m1_t src)
These utility functions help users to truncate or extent current LMUL under same SEW regardless of vl (it won't change content of vl register)
The LMUL extension result of extension part are undefined value.
Example:
// LMUL Truncation, vlmul_trunc_v_<src_lmul>_<target_lmul>
vint64m1_t vlmul_trunc_v_i64m2_i64m1 (vint64m2_t op1);
vint64m1_t vlmul_trunc_v_i64m4_i64m1 (vint64m4_t op1);
vint64m2_t vlmul_trunc_v_i64m4_i64m2 (vint64m4_t op1);
// LMUL Extension, vlmul_ext_v_<src_lmul>_<target_lmul>
vint64m2_t vlmul_ext_v_i64m1_i64m2 (vint64m1_t op1);
vint64m4_t vlmul_ext_v_i64m1_i64m4 (vint64m1_t op1);
vint64m8_t vlmul_ext_v_i64m1_i64m8 (vint64m1_t op1);
These utility functions help users to insert or extract smaller LMUL under same SEW.
Example:
// Insert an smaller LMUL, vset_v_<src_lmul>_<target_lmul>
vint32m2_t vset_v_i32m1_i32m2 (vint32m2_t dest, size_t index, vint32m1_t val);
vint32m4_t vset_v_i32m1_i32m4 (vint32m4_t dest, size_t index, vint32m1_t val);
vint32m4_t vset_v_i32m2_i32m4 (vint32m4_t dest, size_t index, vint32m2_t val);
// Extract an smaller LMUL, vget_v_<src_lmul>_<target_lmul>
vint32m1_t vget_v_i32m2_i32m1 (vint32m2_t src, size_t index);
vint32m1_t vget_v_i32m4_i32m1 (vint32m4_t src, size_t index);
vint32m2_t vget_v_i32m4_i32m2 (vint32m4_t src, size_t index);
These utility functions help users to create segment load/store types and insert/extract elements to/from segment load/store types.
Example:
// Create segment load/store types.
vint32m2x2_t vcreate_i32m2x2(vint32m2_t val1, vint32m2_t val2)
vint32m2x3_t vcreate_i32m2x3(vint32m2_t val1, vint32m2_t val2, vint32m2_t val3)
// Insert an element to segment load/store types.
vint32m2x2_t vset_i32m2x2(vint32m2x2_t tuple, size_t index, vint32m2_t value)
// Extract an element from segment load/store types.
vint32m2_t vget_i32m2x2_i32m2(vint32m2x2_t tuple, size_t index)
vint32m2_t vget_i32m2x3_i32m2(vint32m2x3_t tuple, size_t index)
Overloaded Interface have shorter function name and support less number of intrinsic functions.
Overloaded interface are always keep full function name with removing the suffix.
Compiler could support overloaded interface optionally. Preprocessor macro __riscv_v_intrinsic_overloading
is defined when overloaded interface is available.
Example:
vadd.vv, vadd.vx, vadd.vi will have an unified interface vadd() for them.
vint8m1_t vadd(vint8m1_t op1, vint8m1_t op2, size_t vl);
// The compiler will choose the following intrinsic
vint8m1_t vadd_vv_i8m1(vint8m1_t op1, vint8m1_t op2, size_t vl);
vint8m2_t vadd(vint8m2_t op1, vint8m2_t op2, size_t vl);
// The compiler will choose the following intrinsic
vint8m2_t vadd_vv_i8m2(vint8m2_t op1, vint8m2_t op2, size_t vl);
vint8m1_t vadd(vint8m1_t op1, int8_t op2, size_t vl);
// The compiler will choose the following intrinsic
vint8m1_t vadd_vx_i8m1(vint8m1_t op1, int8_t op2, size_t vl);
vint8mf8_t vadd(vbool64_t mask, vint8mf8_t maskedoff, vint8mf8_t op1, vint8mf8_t op2, size_t vl);
// The compiler will choose the following intrinsic
vint8mf8_t vadd_vv_i8mf8_m (vbool64_t mask, vint8mf8_t maskedoff, vint8mf8_t op1, vint8mf8_t op2, size_t vl);
The unsupported overloading functions are based on the types of input/return arguments:
- Input arguments are scalar type alone. (non-masked vle/vlse, etc.)
- Input argument is empty. (vmclr.m/vmset.m/vid.v)
- Input boolean vector argument with return type of a non boolean vector. (viota)
Keep the output type in the function names.
Example:
// u8 for uint8
vuint8m1_t vreinterpret_u8 (vint8m1_t src);
// x for scalar, xu for unsigned scalar, and f for float.
vint16m1_t vfcvt_x (vfloat16m1_t src, size_t vl);
The scalar type promotions is not obvious if instruction supports source vector type is 2*SEW and SEW, so append
the vx/wx/vf/wf suffix in the function names.
(ex. vwadd[u].vx
/vwadd[u].wx
, vwsub[u].vx
/vwsub[u].wx
, vfwadd_vf
/vfwadd_wf
and vfwsub_vf
/vfwsub_wf
).
In order to consistent, vector-vector operations also append a suffix in the function name.
// Example: users need to specific explicit type for op2 if below functions have the same name.
vuint32mf2_t vwaddu_wx(vuint32mf2_t op1, uint16_t op2, size_t vl);
vuint64m1_t vwaddu_vx(vuint32mf2_t op1, uint32_t op2, size_t vl);
Original overloading name are confusing. Append one more suffix for readability. For example:
// Old.
vint8m1_t vmv (vint8m1_t src, size_t vl); // vmv.v.v
int8_t vmv (vint8m1_t src); // vmv.x.s
vint8m1_t vmv (vint8m1_t dst, int8_t src, size_t vl); // vmv.s.x
float16_t vfmv (vfloat16m1_t src); // vfmv.f.s
vfloat16m1_t vfmv (vfloat16m1_t dst, float16_t src, size_t vl); // vfmv.s.f
// New.
vint8m1_t vmv_v (vint8m1_t src, size_t vl); // vmv.v.v
int8_t vmv_x (vint8m1_t src); // vmv.x.s
vint8m1_t vmv_s (vint8m1_t dst, int8_t src, size_t vl); // vmv.s.x
float16_t vfmv_f (vfloat16m1_t src); // vfmv.f.s
vfloat16m1_t vfmv_s (vfloat16m1_t dst, float16_t src, size_t vl); // vfmv.s.f
Compiler should guarantee the correctness of vtype setting after vsetvl instruction. For example considering the widening multiply example as below.
vl = vsetvl_e16m4(n);
vfloat16m4_t vx = vle16_v_f16m4(ptr_x, vl);
// vsetvl_e32m8(vl); // No need to keep the same vl and change vtype manually
vfloat32m8_t vy = vle32_v_f32m8(ptr_y, vl);
// vsetvl_e16m4(vl); // No need to keep the same vl and change vtype manually
vfwmacc_vf_f32m8(vy, 2.0, vx, vl);
// vsetvl_e32m8(vl); // No need to keep the same vl and change vtype manually
vse32_v_f32m8(ptr_y, vy, vl);
This example has a vl
computed from vsetvl_e16m4
, and changing the type to vfloat32m8_t
in the middle.
With compiler's helping, users don't need to change vtype manually because vfloat16m4_t
and vfloat32m8_t
have the exact same number of elements (same SEW
/LMUL
ratio).
Noted that when using the different vtype intrinsic functions with a new SEW
/LMUL
ratio after vsetvl instruction, the result will raise an illegal-instruction exception.
The V extension spec mentions that the strided load/store instruction with stride of 0 could have different instruction to perform all memory accesses or fewer memory operations. Since needing all memory accesses isn't likely to be common, the compiler implementation is allowed to generate fewer memory operations with strided load/store intrinsics. In other words, compiler does not guarantee generating the all memory accesses instruction in strided load/store intrinsics with stride of 0. If the user needs all memory accesses to be performed, they should use an indexed load/store intrinsics with all zero indices.