Skip to content

Latest commit

 

History

History
401 lines (332 loc) · 20.2 KB

p1191.md

File metadata and controls

401 lines (332 loc) · 20.2 KB

Bitwise and shift operators

Pull request

Table of contents

Problem

Carbon needs operations for working with bit representations of values.

Background

C++ provides four bitwise operations for Boolean algebra: complement (~), and (&), or (|), and xor (^). These are all useful in bit-manipulation code (although ^ is used substantially less than the others). In addition, C++ provides two bit-shift operators << and >> that can perform three different operations: left shift, arithmetic right shift, and logical right shift. The meaning of >> is determined by the signedness of the first operand.

C and Swift use the same set of operators as C++. Rust uses most of the same operators, but uses ! instead of ~ for complement, unifying it with the logical not operator, which is spelled not in Carbon and as ! in Rust and C++. Go uses most of the same operators as C++, but uses unary prefix ^ instead of ~ for complement, mirroring binary ^ for xor.

In addition to the operators provided by C and C++, bit-rotate operators are present in some languages, and a short notation for them may be convenient. Finally, there are other non-degenerate binary Boolean operations with no common operator symbol:

  • The "implies" operator (equivalent to ~a | b).
  • The "implied by" operator (equivalent to a | ~b).
  • The complement of each of the other operators (NAND, NOR, XNOR, "does not imply", "is not implied by").

Overflow in shift operators

The behavior of shift operators in C++ had a turbulent past. The behavior of shift operators has always been undefined if the right operand is out of range -- not between zero inclusive and the bit-width of the left operator exclusive -- but other restrictions have varied:

  • Unsigned left shift has never had any restrictions on the first operand.
  • For signed left shift:
    • In C++98, the result was fully unspecified.
    • In C++11, the result was specified only if the first operand was non-negative and the result fit into the destination type -- that is, if no 1 bit is shifted into the sign bit.
    • In C++14, the result was specified only if the first operand was non-negative and the result fit into the unsigned type corresponding to the destination type -- that is, if no 1 bit is shifted out of the sign bit.
    • In C++20 onwards, there are no restrictions beyond a range restriction on the right operand, and the result is otherwise always specified, even if the left operand is negative.
  • Unsigned right shift has never had any restrictions on the first operand.
  • For signed right shift:
    • In C++17 and earlier, if the left operand is negative, the result is implementation-defined.
    • In C++20 onwards, the result is always specified, even if the left operand is negative.

There is a clear trend towards defining more cases, following two's complement rules.

Proposal

Use the same operator set as C++, but replace ~ with unary prefix ^.

Define the behavior for all cases of << and >> except where the right operand is either negative or is greater than or equal to the bit-width of the left operand.

Details

See changes to the design, and in particular the new section on bitwise and shift operators.

Rationale

  • Performance-critical software
    • Bit operations are important low-level primitives. Providing operators for them is important in order to allow low-level high-performance code to be written elegantly in Carbon.
    • By not defining semantics for << and >> when the right-hand operand is out of range, we can directly use hardware instructions for these operations whose behavior in these cases vary between architectures.
  • Code that is easy to read, understand, and write
    • Using operator notation rather than function call notation for bitwise operators improves readability in code making heavy use of these operations.
  • Practical safety and testing mechanisms
    • Carbon follows C++ in treating << and >> as programming errors when the right hand operand is out of range, but Carbon guarantees that such errors will not directly result in unbounded misbehavior in hardened builds.
  • Modern OS platforms, hardware architectures, and environments
    • All hardware architectures we care to support are natively two's complement architectures, and that assumption allows us to define the semantics of shift operators in the way that makes the most sense for such architectures.
    • Our bitwise operations make no assumptions about the endianness of the hardware architecture, although the shift operators make the most sense on a little-endian or big-endian architecture, which are the only endiannesses we expect to see in modern hardware platforms.
  • Interoperability with and migration from existing C++ code
    • The same set of operators is provided as in C++, making it easy for programmers and programs to migrate, with the exception that the ~ operator is mechanically replaced with ^. This change is expected to be easy for programmers to accommodate, especially given that Rust's choice to replace ~ with ! does not seem to be a source of sustained complaints.
    • The extensibility support reflects the full scope of operator overloading in C++, permitting separate overloading of each of the bitwise operators with custom return types. This should allow smooth interoperability with C++ overloaded operators.

Alternatives considered

Use different symbols for bitwise operators

The operator syntax for bitwise operators was decided in #545. Some of the specific alternatives considered are discussed below.

Use a multi-character spelling

We considered using various multi-character spellings for the bitwise and, or, xor, and complement operators:

  • &:, |:, ^:, ~:
  • .&., .|., .^., .~.
  • .*., .+., .!=., .!.
  • /\, \/, (+), -|
  • bitand, bitor, bitxor, compl

The advantage of switching to such a set of operators is that this would free up the single-character &, |, ^, and ~ tokens for other uses that may occur more frequently in Carbon programs. We have some candidate uses for these operators already:

  • & is used for combining interfaces and as a unary address-of operator.
  • | may be useful for sum types, for alternatives in patterns, or as another kind of bracket as in Ruby's lambda notation.
  • ~ may be useful as a destructive move notation.
  • ^ may be useful as a postfix pointer dereference operator.

Other motivations for switching to a different set of spellings include:

  • There are some concerns that < and << are visually similar, analogous to & and &&.
  • Carbon has moved away from && and other punctuation based logical operators and towards keywords like and. Bitwise operators could similarly switch to keywords like bitand.

However, moving substantially away from the C++ operator set comes with a set of concerns:

  • There are strong established expectations and intuitions about these operators and their spellings among C++ practitioners.
  • In some of the code that uses these operators, they are used a lot, and a more cumbersome operator may consequently cause an outsized detriment on readability.
  • These operations are used particularly in the area of low-level, high-performance code, which is an area for which we want Carbon to be especially appealing. Using short operators for these operations demonstrates our commitment to providing good support for such code.
  • Even if we didn't use these operators as bit operators, we would still want to exercise caution when using them for some other purpose to avoid surprising C++ developers.
  • While some visual similarity exists such as between < and <<, the contexts in which these operators are used are sufficiently different to avoid any serious concerns.
  • The primary motivation of using and instead of && doesn't apply for bitwise operators: the logical operator represents control flow. Separating logical and bitwise "and" operations more visibly seems especially important because of this control flow semantic difference. Without any control flow and with the keywords being significantly longer for bitwise operations, the above considerations were the dominant ones that led us to stick with familiar & and | spellings.

Don't provide an xor operator

We considered omitting the ^ operator, providing this functionality in some other way, such as a named function or an xor keyword, while keeping the & and | symbols for bitwise operations. We could take a similar approach for the complement operation, such as by using a compl keyword. The primary motivation was to avoid spending two precious operator characters on relatively uncommon operations. However, we did not want to apply the same change to & and |, and it seemed important to maintain consistency between the three binary bitwise operators from C++.

Using ^ for both operations provides some of the benefits here, allowing us to reclaim ~, without introducing the inconsistency that would result from using keywords.

Use ~, or some other symbol, for complement

We could follow C++ and use ~ as the complement operator. However, using ~ for this purpose spends a precious operator character on a relatively uncommon operation, and ~ is often visually confusible with -, creating the potential for readability problems. Also, in C++, ~ is overloaded to also refer to destruction, and we may want to use the same notation for destruction or destructive move semantics in Carbon.

We found ^ to be a satisfying alternative with a good rationale and mnemonic: ^ is a bit-flipping operator -- a ^ b flips the bits in a that are specified in b -- and complement is an operator that flips all the bits. ^a is equivalent to a ^ n, where n is the all-one-bits value in the type of a.

We also considered using ! for complement, like Rust does. Unlike in Rust, this would not be a generalization of ! on bool, because we use not for bool negation, and repurposing ! in this way compared to C++ seemed confusing.

Provide different operators for arithmetic and logical shifts

We could provide different operators for arithmetic right shift and logical right shift. This might allow programmers to better express their intent. However, it seems unnecessary, as using the type of the left operand is a strategy that doesn't appear to have caused significant problems in practice in the languages that have followed it.

Basing the kind of shift on the signedness of the left operand also follows from viewing a negative number as having an infinite number of leading 1 bits, which is the underlying mathematical model behind the two's complement representation.

Provide rotate operators

We could provide bitwise rotation operators. However, there doesn't seem to be a sufficient need to justify adding another operator symbol for this purpose.

Guarantee behavior of large shifts

Logically, the behavior of shifts is meaningful for all values of the second operand:

  • A shift by an amount greater than or equal to the bit-width of the first operand will shift out all of the original bits, producing a result where all value bits are the same.
  • A shift in one direction by a negative amount is treated as a shift in the opposite direction by the negation of that amount.

Put another way, we can view the bits of the first operand as an N-bit window into an infinite sequence of bits, with infinitely many leading sign bits (all zeroes for an unsigned value) and infinitely many trailing zero bits after a notional binary point, and a shift moves that window around. Or equivalently, a shift is always a multiplication by 2N followed by rounding and wrapping.

We could provide the correct result for all shifts, regardless of the magnitude of the second operand. This is the approach taken by Python, except that Python rejects negative shift counts. The primary reason we do not do this is lack of hardware support. For example, x86 does not have an instruction to perform this operation. Rather, x86 provides shift instructions that mask off all bits of the second operand except for the bottom 5 or 6, meaning that a left shift of a 64-bit operand by 64 will return the operand unchanged rather than producing zero.

We could instead provide x86-like behavior, guaranteeing to consider only the lowest N bits of the second operand when the first operand is an iN or uN. This is the approach taken by Java for its 32-bit int type and 64-bit long type, where the second operand is taken modulo 32 or 64, respectively, and in JavaScript, where operands of bitwise and bit-shift operators are treated as 32-bit integers and the second operand of a shift is taken modulo 32. This approach would provide an operation that can be implemented by a single instruction on x86 platforms when N is 32 or 64, and for all smaller types and for all other platforms the operation can be implemented with two instructions: a mask and a shift. For larger types, single-instruction support may not be available, but nonetheless the performance will be close to optimal, requiring at most one additional mask. There is still some performance cost in some cases, but the primary reason we do not do this is the same reason we choose to not define signed integer overflow: this masked result is unlikely to be the value that the developer actually wanted.

Instead of the above options, Carbon treats a second operand that is not in the interval [0, N) as a programming error, just like signed integer overflow:

  • Debugging builds can detect and report this error without the risk of false positives.
  • Performance builds can optimize on the basis that this situation will not occur, and can in particular use the dedicated x86 instructions that ignore the high order bits of the second operand.
  • Optimized builds guarantee that either the programming error results in program termination or that some value is produced, and moreover that said value is the result of applying some mathematical shift to the input. For example, it's valid for an i32 shift to be implemented by an x86 64-bit shift that will produce 0 if the second operand is in [32, 63) but that will treat a second operand of, say, 64 or -64 the same as 0.

Support shifting a constant by a variable

We considered various ways to support

var a: i32 = ...;
var b: i32 = 1 << a;
var c: i32 = 1234 >> a;

with no explicit type specified for the first operand of a bit-shift operator. We considered the following options:

  • Use the type of the second operand as the result type. This would be surprising, because the type of the second operand doesn't otherwise influence the result type of a built-in bit-shift operator.
  • Use some suitable integer type that can fit the first operand. However, this is unlikely to do the right thing for a left-shift, and will frequently pick either a type that's too large, resulting in the program being rejected due to narrowing, or a type that's too small, resulting in a program that has undefined behavior due to the second operand being too large. We could apply this approach only for right shifts, but it was deemed too inconsistent to use different rules for left and right shifts.
  • We could find a way to defer picking the type in which the operation is performed until later. For example, we could treat 1 << a as a value of a new type that carries its left-hand operand as a type parameter and its right-hand operand as runtime state, and allow that type to be converted in the same way as its integer constant. However, this would introduce substantial complexity: reasonable and expected uses such as
    var mask: u32 = (1 << a) - 1;
    
    would require a second new type for a shifted value plus an offset, and general support would require a facility analogous to expression templates. Further, this facility would allow implicit conversions that notionally overflow, such as would happen in the above example when a is greater than 32.

In the absence of a good approach, we disallow such conversions for now. The above example can be written as:

var a: i32 = ...;
var b: i32 = (1 as i32) << a;
var c: i32 = (1234 as i32) >> a;

Converting complements to unsigned types

We view an integer constant has having infinitely many high-order sign bits followed by some number of lower-order value bits. As a consequence, the complement of a positive integer constant is negative. As a result, some important forms of initialization use a negative integer constant initializer for an unsigned type:

// Initializer here is the integer value -8.
var mask: u32 = ^7;

We considered some options for handling this:

  • We could allow negative integer constants to convert to unsigned types if doing so only discards sign bits. This violates the "semantics-preserving" rule for implicit conversions.
  • We could change our model of integer constants to distinguish between "masks" -- numbers with infinitely many 1 bits preceding the value bits that are nonetheless not considered to be negative. This was considered to introduce too much complexity.
  • We could allow conversions to unsigned types from signed types and negative constants in general, or at least in cases where the signed operand is no wider than the unsigned type, and perform wrapping. The latter option seems plausible, but we don't have sufficient motivation for it, and were worried about a risk of bugs from allowing an implicit conversion at runtime that converts a negative value to an unsigned type.
  • We could reject such initializations, with an explicit conversion required to convert such values to unsigned types. This seems to present unacceptable ergonomics for code performing bit-manipulation.

On balance, our preferred option was to permit implicit conversions from negative literals to unsigned types so long as we only discard sign bits.