-
Notifications
You must be signed in to change notification settings - Fork 273
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
significantly cleanup and flesh out page on UB #158
base: master
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -13,21 +13,24 @@ any of these things will cause the ever dreaded Undefined Behavior. Invoking | |
Undefined Behavior gives the compiler full rights to do arbitrarily bad things | ||
to your program. You definitely *should not* invoke Undefined Behavior. | ||
|
||
|
||
|
||
## Fundamental Undefined Behaviour | ||
|
||
Unlike C, Undefined Behavior is pretty limited in scope in Rust. All the core | ||
language cares about is preventing the following things: | ||
|
||
* Dereferencing (using the `*` operator on) dangling, or unaligned pointers, or | ||
wide pointers with invalid metadata (see below) | ||
* Dereferencing (using the `*` operator on) a raw pointer that is dangling, unaligned, or that has invalid metadata (if wide; see references below) | ||
* Breaking the [pointer aliasing rules][] | ||
* Unwinding into another language | ||
* Unwinding out of a function that doesn't have a rust-native [calling convention][] | ||
* Causing a [data race][race] | ||
* Executing code compiled with [target features][] that the current thread of execution does | ||
not support | ||
* Producing invalid values (either alone or as a field of a compound type such | ||
as `enum`/`struct`/array/tuple): | ||
* a `bool` that isn't 0 or 1 | ||
* an `enum` with an invalid discriminant | ||
* a null `fn` pointer | ||
* a `fn` pointer that is null | ||
* a `char` outside the ranges [0x0, 0xD7FF] and [0xE000, 0x10FFFF] | ||
* a `!` (all values are invalid for this type) | ||
* a reference that is dangling, unaligned, points to an invalid value, or | ||
|
@@ -37,14 +40,10 @@ language cares about is preventing the following things: | |
* `dyn Trait` metadata is invalid if it is not a pointer to a vtable for | ||
`Trait` that matches the actual dynamic trait the reference points to | ||
* a `str` that isn't valid UTF-8 | ||
* an integer (`i*`/`u*`), floating point value (`f*`), or raw pointer read from | ||
[uninitialized memory][] | ||
* a non-padding byte that is [uninitialized memory][] (see discussion below) | ||
* a type with custom invalid values that is one of those values, such as a | ||
`NonNull` that is null. (Requesting custom invalid values is an unstable | ||
feature, but some stable libstd types, like `NonNull`, make use of it.) | ||
|
||
"Producing" a value happens any time a value is assigned, passed to a | ||
function/primitive operation or returned from a function/primitive operation. | ||
feature, but some stable stdlib types, like `NonNull`, make use of it.) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. "stdlib" is AFAIK not a term we use anyhwere. If you don't like "libstd" (because it seems to exclude libcore), what about "standard library"? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. genuinely shocked to learn that this is the only occurrence of either string in the nomicon, with "std" only being used in the title of "beneath std". Cool with keeping your version, completely thought I was just homogenizing the word with the rest of the book. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. "libstd" is likely a bad choice though as some people read that to exclude libcore and liballoc. So, I'd go for "standard library". There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. eh, sure, I guess |
||
|
||
A reference/pointer is "dangling" if it is null or not all of the bytes it | ||
points to are part of the same allocation (so in particular they all have to be | ||
|
@@ -54,18 +53,97 @@ empty, "dangling" is the same as "non-null". Note that slices point to their | |
entire range, so it's very important that the length metadata is never too | ||
large. If for some reason this is too cumbersome, consider using raw pointers. | ||
|
||
That's it. That's all the causes of Undefined Behavior baked into Rust. Of | ||
course, unsafe functions and traits are free to declare arbitrary other | ||
constraints that a program must maintain to avoid Undefined Behavior. For | ||
instance, the allocator APIs declare that deallocating unallocated memory is | ||
Undefined Behavior. | ||
|
||
However, violations of these constraints generally will just transitively lead to one of | ||
the above problems. Some additional constraints may also derive from compiler | ||
intrinsics that make special assumptions about how code can be optimized. For instance, | ||
Vec and Box make use of intrinsics that require their pointers to be non-null at all times. | ||
|
||
Rust is otherwise quite permissive with respect to other dubious operations. | ||
## Invalid Values: Yes We Mean It | ||
|
||
Many have trouble accepting the consequences of invalid values, so they merit | ||
some extra discussion here so no one misses it. The claim being made here is a | ||
very strong and surprising one, so read carefully. | ||
|
||
A value is *produced* whenever it is assigned, passed to something, or returned | ||
from something. Keep in mind references get to assume their referents are valid, | ||
so you can't even create a reference to an invalid value. | ||
|
||
Additionally, [uninitialized memory][] is **always invalid**, so you can't assign it to | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. See above, this is incorrect in its generality. You are restricting this later, but I think that's too late. Also, first saying one thing and then later "well but we didn't really mean" it is really confusing. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is a fair criticism. I wanted to try this approach out, but it's definitely worth rethinking. Basically I kinda like this approach because, again, our primary interest is in preventing programmers from doing bad things. So "you can't do this" followed by "...except for here" isn't a terrible approach with that goal. If people bounce off, they come away with a hyper-conservative model and try to avoid messing with uninitialized memory, which is good imo. If people see that and get confused/angry, they can keep reading and go "aha!". There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In my experience (which is mostly based on writing scientific papers), one has to be very careful when leading with a wrong statement and correcting that later. This is definitely sometimes a good strategy, but it's a double-edged sword when you also consider e.g. your reader's faith in you. Usually, the least I'd do in a case like this is to add a footnote (would have to be a parenthetical here because footnotes are rendered too far away) saying something like "this is not strictly correct, we will refine this statement later". This at least prepares the reader for the blow coming later. That said, in this case, I disagree with the entire approach. I think it is wrong to call out uninitialized memory as anything special. The fact that an uninitialized On top of that, we should mentally prepare for when rust-lang/unsafe-code-guidelines#71 gets resolved. The likely resolution (and IMO the best one) is that we will declare uninitialized integers as not being UB. Only when an uninitialized integer is fed into a primitive operation (like There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I was under the impression we were very far from resolving the uninit integer question satisfactorily. If that's not the case, I agree we should focus on a tighter definition. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I won't dare making predictions. But the person arguing most strongly against allowing uninit integers was me, and I changed my mind. gnzlbg went back and forth, not sure what their current stanza is. I don't actually know any argument against allowing uninit integers, besides the few that I opened the thread with, and I don't consider them convincing enough (any more) given the benefits of allowing uninit integers (mostly, everyone does it anyway^^). If I had infinite time, the RFC would already have been written. I consider this the least controversial amongst the open questions around validity that we have. I might be missing a controversy though. |
||
anything, pass it to anything, return it from anything, or take a reference to it. | ||
Padding bytes aren't technically part of a value's memory, and so may be left | ||
uninitialized. For unions, this includes the padding bytes of *all* variants, | ||
as unlike enums, unions are never definitely set to any particular variant (Rust | ||
does not have the C++ notion of an "active member"). This makes unions | ||
are the preferred mechanism for working directly with uninitialized memory (see | ||
[MaybeUninit][] for details). | ||
|
||
In simple and blunt terms: you cannot ever even *suggest* the existence of an | ||
invalid value. No, it's not ok if you "don't use" or "don't read" the value. | ||
Invalid values are **instant Undefined Behaviour**. The only correct way to | ||
This comment was marked as resolved.
Sorry, something went wrong. |
||
manipulate memory that could be invalid is with raw pointers using methods | ||
like write and copy. If you want to leave a local variable or struct field | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. These could be links to the method docs. |
||
uninitialized (or otherwise invalid), you must use a union (like MaybeUninit) | ||
or enum (like Option) which clearly indicates at the type level that this | ||
memory may not be part of any value. | ||
|
||
|
||
|
||
|
||
## Other Sources of Undefined Behavior | ||
|
||
That's it. That's all the causes of Undefined Behavior baked into Rust. | ||
|
||
Well, ok, only sort of. | ||
|
||
While it's true that the language itself doesn't define that much Undefined | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is not really true either because intrinsics are part of the language spec. That's what separates them from normal standard library functions. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not sure I agree. Intrinsics are part of the compiler+stdlib fused implementation. It is not, to my knowledge, expected that you can hot-swap any stdlib implementation with any compiler implementation. Certainly this is not the case for C++. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Intrinsics are definitely considered T-lang jurisdiction in Rust. They also have to be defined as primitive operations in the Rust Abstract Machine. The only difference between intrinsics and e.g. MIR binops is their syntax and data structure representation, really. Both I don't see any way to treat intinsics as not being part of the language. Sure, they can be compiler-specific language extensions, but that still makes them very different from library functions. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would be quite surprised and disappointed if writing a correct Rust program ever required a programmer to meaningfully distinguishing a function call from "an operation in the Rust abstract machine". There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. But anyway, the important point here is that intrinsic UB only shows up when you use the instrinsic. I'll think about tweaking this line. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I don't understand what you mean here. Most function calls, say to So, much like Put differently, if you are saying intrinsics are not "part of the language", then why do you call
Indeed most our primitive operations do not have UB. Pointer deref ( |
||
Behavior, libraries may use unsafe functions and unsafe traits to define | ||
their own contracts with Undefined Behavior at stake. For instance, the raw | ||
allocator APIs declare that you aren't allowed to deallocate unallocated memory, | ||
and the Send trait declares that implementors must in fact be safe to move to | ||
another thread. | ||
|
||
Usually these constraints are in place because violating them will lead to one | ||
of Rust's Fundamental Undefined Behaviors, but that doesn't have to be the case. | ||
In particular, several standard library APIs are actually thin wrappers around | ||
*intrinsics* which tell the compiler it can make certain assumptions. | ||
|
||
It's useful to distinguish between these "intrinsic" sources of UB and | ||
the fundamental ones because the intrinsic ones *don't matter* unless someone | ||
actually invokes the relevant functions. The fundamental ones, on the other hand, | ||
are ever-present. | ||
|
||
With that said, some intrinsics, like the surprisingly strict [`ptr::offset`][], | ||
are *pretty* close to fundamental. 😅 | ||
|
||
|
||
|
||
## Not Technically Fundamental Undefined Behavior | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I wonder if it is useful to introduce the terminology of "language-level UB" and "library-level UB" here? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. IMO the interesting thing to define here is the notion of "things safe code can't ever be allowed to do, but isn't actually UB (but will almost certainly lead to UB)" There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I say this because this is specifically a page about the distinction between safe/unsafe. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That would basically be what I call the "safety invariant". |
||
|
||
There are a few things in Rust that aren't *technically* Fundamental Undefined Behavior, | ||
but which library authors can implicitly assume don't happen, with Undefined | ||
Behavior at stake. As such, it should be impossible to do these things in safe | ||
code, as they can very easily lead to Undefined Behavior. | ||
|
||
This section is non-exhaustive, although that may change in the future. | ||
|
||
It is *technically not* Undefined Behavior to run a value's destructor twice. | ||
Authors of destructors may however assume this doesn't happen. For instance, if | ||
you drop a Box twice it will almost certainly result in Undefined Behavior. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. No need to hedge, this is a double-free. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I am hedging here because technically I think it's not UB if someone manages to reallocate the pointer before you run the second drop. You just freed their allocation, and they're probably going to Do An UB, but not necessarily. Or is their some fancy compiler-knows-about-malloc shenanigans where this is still UB because the compiler "knows" you're not allowed to get lucky and free someone else's allocation like that? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Pointer provenance rules (even the basic C-style provenance described in the UCG glossary) mean that even if the physical address gets reused, you would not be allowed to use old pointers to access the new allocation. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. sweet |
||
Technically someone *could* explicitly support double-dropping their type, although | ||
it's hard to say why. | ||
|
||
It is *technically not* Undefined Behavior to reinterpret a bunch of | ||
bytes as a type whose fields you don't have public access to (assuming you | ||
don't create any Invalid Values). As [the next section][] discusses, it's very | ||
important for library authors to be able to rely on privacy and ownership as a | ||
sort of program integrity proof. For instance, if you reinterpret some random | ||
non-zero bytes as a Vec, this will almost certainly result in Undefined Behavior. | ||
It's very important that you *can* just create types from a bunch of bytes if | ||
done correctly (such as pairing ptr::read with ptr::write). | ||
|
||
|
||
|
||
|
||
## Completely Safe Behavior | ||
|
||
Rust can also be quite permissive of dubious operations. | ||
Rust considers it "safe" to: | ||
|
||
* Deadlock | ||
|
@@ -78,9 +156,13 @@ Rust considers it "safe" to: | |
|
||
However any program that actually manages to do such a thing is *probably* | ||
incorrect. Rust provides lots of tools to make these things rare, but | ||
these problems are considered impractical to categorically prevent. | ||
some things are just impractical to categorically prevent. | ||
|
||
[pointer aliasing rules]: references.html | ||
[uninitialized memory]: uninitialized.html | ||
[the next section]: working-with-unsafe.html | ||
[race]: races.html | ||
[target features]: ../reference/attributes/codegen.html#the-target_feature-attribute | ||
[MaybeUninit]: ../core/mem/union.MaybeUninit.html | ||
[calling convention]: ../reference/items/external-blocks.html#abi | ||
[`ptr::offset`]: ../core/primitive.pointer.html#method.offset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one sticks out. Everything else here is "check the type, then do something", this one here is not. It also introduces the notion of padding, IMO unnecessarily. And finally I think it is wrong, unless you are saying that
MaybeUninit<u8>
has a padding byte, which however would also be wrong.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For readability, I am using very ambiguous wording here that requires the "discussion below" to have a meaning rigorous enough for your standards :)
But yes this is something that needs consideration (although remember: we just need to be conservatively correct, and not precisely correct). Things being "too UB" aren't an issue unless they preclude very important cases that everyone agrees must work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My issue with the wording here is that is is not ambiguous enough, making it just plain wrong in my eyes.
They are an issue if we tell people on reddit (as I did) that e.g.
is okay (and is on current stable the only way to construct an element of this type) -- and then they go to the nomicon and see this described as UB, and their reaction will be "well screw the nomicon, it's clearly bogus".
Put another way, if you only want to be conservatively correct, just declare every
unsafe
block as UB. ;)