Skip to content

Commit

Permalink
Merge pull request #97 from ParkMyCar/refactor/rename-to-compactstring
Browse files Browse the repository at this point in the history
refactor: Rename CompactStr to CompactString
  • Loading branch information
ParkMyCar authored May 27, 2022
2 parents 57f70bd + fe26a72 commit e51bb8e
Show file tree
Hide file tree
Showing 13 changed files with 556 additions and 537 deletions.
38 changes: 19 additions & 19 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,13 +32,13 @@
<br />

### About
A `CompactStr` is a more memory efficient string type, that can store smaller strings on the stack, and transparently stores longer strings on the heap (aka a small string optimization).
A `CompactString` is a more memory efficient string type, that can store smaller strings on the stack, and transparently stores longer strings on the heap (aka a small string optimization).
They can mostly be used as a drop in replacement for `String` and are particularly useful in parsing, deserializing, or any other application where you may
have smaller strings.

### Properties
A `CompactStr` specifically has the following properties:
* `size_of::<CompactStr>() == size_of::<String>()`
A `CompactString` specifically has the following properties:
* `size_of::<CompactString>() == size_of::<String>()`
* Stores up to 24 bytes on the stack
* 12 bytes if running on a 32 bit architecture
* Strings longer than 24 bytes are stored on the heap
Expand All @@ -49,8 +49,8 @@ A `CompactStr` specifically has the following properties:

### Features
`compact_str` has the following features:
1. `serde`, which implements [`Deserialize`](https://docs.rs/serde/latest/serde/trait.Deserialize.html) and [`Serialize`](https://docs.rs/serde/latest/serde/trait.Serialize.html) from the popular [`serde`](https://docs.rs/serde/latest/serde/) crate, for `CompactStr`.
2. `bytes`, which provides two methods `from_utf8_buf<B: Buf>(buf: &mut B)` and `from_utf8_buf_unchecked<B: Buf>(buf: &mut B)`, which allows for the creation of a `CompactStr` from a [`bytes::Buf`](https://docs.rs/bytes/latest/bytes/trait.Buf.html)
1. `serde`, which implements [`Deserialize`](https://docs.rs/serde/latest/serde/trait.Deserialize.html) and [`Serialize`](https://docs.rs/serde/latest/serde/trait.Serialize.html) from the popular [`serde`](https://docs.rs/serde/latest/serde/) crate, for `CompactString`.
2. `bytes`, which provides two methods `from_utf8_buf<B: Buf>(buf: &mut B)` and `from_utf8_buf_unchecked<B: Buf>(buf: &mut B)`, which allows for the creation of a `CompactString` from a [`bytes::Buf`](https://docs.rs/bytes/latest/bytes/trait.Buf.html)

### How it works
Note: this explanation assumes a 64-bit architecture, for 32-bit architectures generally divide any number by 2.
Expand All @@ -65,27 +65,27 @@ e.g. its layout is something like the following:

This results in 24 bytes being stored on the stack, 8 bytes for each field. Then the actual string is stored on the heap, usually with additional memory allocated to prevent re-allocating if the string is mutated.

The idea of `CompactStr` is instead of storing metadata on the stack, just store the string itself. This way for smaller strings we save a bit of memory, and we
don't have to heap allocate so it's more performant. A `CompactStr` is limited to 24 bytes (aka `size_of::<String>()`) so it won't ever use more memory than a
The idea of `CompactString` is instead of storing metadata on the stack, just store the string itself. This way for smaller strings we save a bit of memory, and we
don't have to heap allocate so it's more performant. A `CompactString` is limited to 24 bytes (aka `size_of::<String>()`) so it won't ever use more memory than a
`String` would.

The memory layout of a `CompactStr` looks something like:
The memory layout of a `CompactString` looks something like:

`CompactStr: [ buffer<23> | len<1> ]`
`CompactString: [ buffer<23> | len<1> ]`

#### Memory Layout
Internally a `CompactStr` has two variants:
Internally a `CompactString` has two variants:
1. **Inline**, a string <= 24 bytes long
2. **Heap** allocated, a string > 24 bytes long

To maximize memory usage, we use a [`union`](https://doc.rust-lang.org/reference/items/unions.html) instead of an `enum`. In Rust an `enum` requires at least 1 byte
for the discriminant (tracking what variant we are), instead we use a `union` which allows us to manually define the discriminant. `CompactStr` defines the
for the discriminant (tracking what variant we are), instead we use a `union` which allows us to manually define the discriminant. `CompactString` defines the
discriminant *within* the last byte, using any extra bits for metadata. Specifically the discriminant has two variants:

1. `0b11111111` - All 1s, indicates **heap** allocated
2. `0b11XXXXXX` - Two leading 1s, indicates **inline**, with the trailing 6 bits used to store the length

and specifically the overall memory layout of a `CompactStr` is:
and specifically the overall memory layout of a `CompactString` is:

1. `heap: { ptr: NonNull<u8>, len: usize, cap: Capacity }`
2. `inline: { buffer: [u8; 24] }`
Expand All @@ -94,29 +94,29 @@ and specifically the overall memory layout of a `CompactStr` is:

For **heap** allocated strings we use a custom `BoxString` which normally stores the capacity of the string on the stack, but also optionally allows us to store it on the heap. Since we use the last byte to track our discriminant, we only have 7 bytes to store the capacity, or 3 bytes on a 32-bit architecture. 7 bytes allows us to store a value up to `2^56`, aka 64 petabytes, while 3 bytes only allows us to store a value up to `2^24`, aka 16 megabytes.

For 64-bit architectures we always inline the capacity, because we can safely assume our strings will never be larger than 64 petabytes, but on 32-bit architectures, when creating or growing a `CompactStr`, if the text is larger than 16MB then we move the capacity onto the heap.
For 64-bit architectures we always inline the capacity, because we can safely assume our strings will never be larger than 64 petabytes, but on 32-bit architectures, when creating or growing a `CompactString`, if the text is larger than 16MB then we move the capacity onto the heap.

We handle the capacity in this way for two reaons:
1. Users shouldn't have to pay for what they don't use. Meaning, in the _majority_ of cases the capacity of the buffer could easily fit into 7 or 3 bytes, so the user shouldn't have to pay the memory cost of storing the capacity on the heap, if they don't need to.
2. Allows us to convert `From<String>` in `O(1)` time, by taking the parts of a `String` (e.g. `ptr`, `len`, and `cap`) and using those to create a `CompactStr`, without having to do any heap allocations. This is important when using `CompactStr` in large codebases where you might have `CompactStr` working alongside of `String`.
2. Allows us to convert `From<String>` in `O(1)` time, by taking the parts of a `String` (e.g. `ptr`, `len`, and `cap`) and using those to create a `CompactString`, without having to do any heap allocations. This is important when using `CompactString` in large codebases where you might have `CompactString` working alongside of `String`.

For **inline** strings we only have a 24 byte buffer on the stack. This might make you wonder how can we store a 24 byte long string, inline? Don't we also need to store the length somewhere?

To do this, we utilize the fact that the last byte of our string could only ever have a value in the range `[0, 192)`. We know this because all strings in Rust are valid [UTF-8](https://en.wikipedia.org/wiki/UTF-8), and the only valid byte pattern for the last byte of a UTF-8 character (and thus the possible last byte of a string) is `0b0XXXXXXX` aka `[0, 128)` or `0b10XXXXXX` aka `[128, 192)`. This leaves all values in `[192, 255]` as unused in our last byte. Therefore, we can use values in the range of `[192, 215]` to represent a length in the range of `[0, 23]`, and if our last byte has a value `< 192`, we know that's a UTF-8 character, and can interpret the length of our string as `24`.

Specifically, the last byte on the stack for a `CompactStr` has the following uses:
* `[0, 192)` - Is the last byte of a UTF-8 char, the `CompactStr` is stored on the stack and implicitly has a length of `24`
* `[192, 215]` - Denotes a length in the range of `[0, 23]`, this `CompactStr` is stored on the stack.
Specifically, the last byte on the stack for a `CompactString` has the following uses:
* `[0, 192)` - Is the last byte of a UTF-8 char, the `CompactString` is stored on the stack and implicitly has a length of `24`
* `[192, 215]` - Denotes a length in the range of `[0, 23]`, this `CompactString` is stored on the stack.
* `[215, 255)` - Unused
* `255` - Denotes this `CompactStr` is stored on the heap
* `255` - Denotes this `CompactString` is stored on the heap

### Testing
Strings and unicode can be quite messy, even further, we're working with things at the bit level. `compact_str` has an _extensive_ test suite comprised of unit testing, property testing, and fuzz testing, to ensure our invariants are upheld. We test across all major OSes (Windows, macOS, and Linux), architectures (64-bit and 32-bit), and endian-ness (big endian and little endian).

Fuzz testing is run with `libFuzzer` _and_ `AFL++` with `AFL++` running on both `x86_64` and `ARMv7` architectures. We test with [`miri`](https://github.com/rust-lang/miri) to catch cases of undefined behavior, and run all tests on every rust compiler since `v1.49` to ensure support for our minimum supported Rust version (MSRV).

### `unsafe` code
`CompactStr` uses a bit of unsafe code because accessing fields from a `union` is inherently unsafe, the compiler can't guarantee what value is actually stored.
`CompactString` uses a bit of unsafe code because accessing fields from a `union` is inherently unsafe, the compiler can't guarantee what value is actually stored.
We also have some manually implemented heap data structures, i.e. `BoxString`, and mess with bytes at a bit level.
That being said, uses of unsafe code in this library are quite limited and constrained to only where absolutely necessary, and always documented with
`// SAFETY: <reason>`.
Expand Down
95 changes: 48 additions & 47 deletions bench/benches/apis.rs
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
//! Benchmarks for various APIs to make sure `CompactStr` is at least no slower than `String`
//! Benchmarks for various APIs to make sure `CompactString` is at least no slower than `String`
use std::time::Instant;

use compact_str::CompactStr;
use compact_str::CompactString;
use criterion::{
black_box,
criterion_group,
Expand All @@ -12,9 +12,9 @@ use criterion::{

static VERY_LONG_STR: &str = include_str!("../data/moby10b.txt");

fn compact_str_inline_length(c: &mut Criterion) {
fn compact_string_inline_length(c: &mut Criterion) {
let word = "i am short";
let compact_str = CompactStr::new(word);
let compact_str = CompactString::new(word);
c.bench_function("inline length", |b| {
b.iter(|| {
let len = black_box(compact_str.len());
Expand All @@ -23,9 +23,9 @@ fn compact_str_inline_length(c: &mut Criterion) {
});
}

fn compact_str_heap_length(c: &mut Criterion) {
fn compact_string_heap_length(c: &mut Criterion) {
let word = "I am a very long string that will get allocated on the heap";
let compact_str = CompactStr::new(word);
let compact_str = CompactString::new(word);
c.bench_function("heap length", |b| {
b.iter(|| {
let len = black_box(compact_str.len());
Expand All @@ -34,8 +34,8 @@ fn compact_str_heap_length(c: &mut Criterion) {
});
}

fn compact_str_very_big_heap_length(c: &mut Criterion) {
let compact_str = CompactStr::new(VERY_LONG_STR);
fn compact_string_very_big_heap_length(c: &mut Criterion) {
let compact_str = CompactString::new(VERY_LONG_STR);
c.bench_function("very long heap length", |b| {
b.iter(|| {
let len = black_box(compact_str.len());
Expand All @@ -44,31 +44,31 @@ fn compact_str_very_big_heap_length(c: &mut Criterion) {
});
}

fn compact_str_reserve_small(c: &mut Criterion) {
fn compact_string_reserve_small(c: &mut Criterion) {
c.bench_function("reserve small", |b| {
b.iter(|| {
let mut compact_str = CompactStr::default();
let mut compact_str = CompactString::default();
black_box(compact_str.reserve(10));
})
});
}

fn compact_str_reserve_large(c: &mut Criterion) {
fn compact_string_reserve_large(c: &mut Criterion) {
c.bench_function("reserve large", |b| {
b.iter(|| {
let mut compact_str = CompactStr::default();
let mut compact_str = CompactString::default();
black_box(compact_str.reserve(100));
})
});
}

fn compact_str_clone_small(c: &mut Criterion) {
let compact = CompactStr::new("i am short");
fn compact_string_clone_small(c: &mut Criterion) {
let compact = CompactString::new("i am short");
c.bench_function("clone small", |b| b.iter(|| compact.clone()));
}

fn compact_str_clone_large_and_modify(c: &mut Criterion) {
let compact = CompactStr::new("I am a very long string that will get allocated on the heap");
fn compact_string_clone_large_and_modify(c: &mut Criterion) {
let compact = CompactString::new("I am a very long string that will get allocated on the heap");
c.bench_function("clone large", |b| {
b.iter(|| {
let mut clone = compact.clone();
Expand All @@ -79,53 +79,54 @@ fn compact_str_clone_large_and_modify(c: &mut Criterion) {
});
}

fn compact_str_extend_chars_empty(c: &mut Criterion) {
fn compact_string_extend_chars_empty(c: &mut Criterion) {
c.bench_function("extend chars empty", |b| {
b.iter(|| {
let mut compact =
CompactStr::new("I am a very long string that will get allocated on the heap");
CompactString::new("I am a very long string that will get allocated on the heap");
compact.extend("".chars());
})
});
}

fn compact_str_extend_chars_short(c: &mut Criterion) {
fn compact_string_extend_chars_short(c: &mut Criterion) {
c.bench_function("extend chars short", |b| {
b.iter(|| {
let mut compact = CompactStr::new("hello");
let mut compact = CompactString::new("hello");
compact.extend((0..10).map(|_| '!'));
})
});
}

fn compact_str_extend_chars_inline_to_heap_20(c: &mut Criterion) {
fn compact_string_extend_chars_inline_to_heap_20(c: &mut Criterion) {
c.bench_function("extend chars inline to heap, 20", |b| {
b.iter(|| {
let mut compact = CompactStr::new("hello world");
let mut compact = CompactString::new("hello world");
compact.extend((0..20).map(|_| '!'));
})
});
}

fn compact_str_extend_chars_heap_20(c: &mut Criterion) {
fn compact_string_extend_chars_heap_20(c: &mut Criterion) {
c.bench_function("extend chars heap, 20", |b| {
b.iter(|| {
let mut compact = CompactStr::new("this is a long string that will start on the heap");
let mut compact =
CompactString::new("this is a long string that will start on the heap");
compact.extend((0..20).map(|_| '!'));
})
});
}

fn compact_str_from_string_inline(c: &mut Criterion) {
fn compact_string_from_string_inline(c: &mut Criterion) {
c.bench_function("compact_str_from_string_inline", |b| {
b.iter_custom(|iters| {
let mut durations = vec![];
for _ in 0..iters {
let word = String::from("I am short");

// only time how long it takes to go from String -> CompactStr
// only time how long it takes to go from String -> CompactString
let start = Instant::now();
let c = CompactStr::from(word);
let c = CompactString::from(word);
let duration = start.elapsed();

// explicitly drop _after_ we've finished timing
Expand All @@ -138,16 +139,16 @@ fn compact_str_from_string_inline(c: &mut Criterion) {
});
}

fn compact_str_from_string_heap(c: &mut Criterion) {
fn compact_string_from_string_heap(c: &mut Criterion) {
c.bench_function("compact_str_from_string_heap", |b| {
b.iter_custom(|iters| {
let mut durations = vec![];
for _ in 0..iters {
let word = String::from("I am a long string, look at me!");

// only time how long it takes to go from String -> CompactStr
// only time how long it takes to go from String -> CompactString
let start = Instant::now();
let c = CompactStr::from(word);
let c = CompactString::from(word);
let duration = start.elapsed();

// explicitly drop _after_ we've finished timing
Expand All @@ -160,16 +161,16 @@ fn compact_str_from_string_heap(c: &mut Criterion) {
});
}

fn compact_str_from_string_heap_long(c: &mut Criterion) {
fn compact_string_from_string_heap_long(c: &mut Criterion) {
c.bench_function("compact_str_from_string_heap_long", |b| {
b.iter_custom(|iters| {
let mut durations = vec![];
for _ in 0..iters {
let word = String::from(VERY_LONG_STR);

// only time how long it takes to go from String -> CompactStr
// only time how long it takes to go from String -> CompactString
let start = Instant::now();
let c = CompactStr::from(word);
let c = CompactString::from(word);
let duration = start.elapsed();

// explicitly drop _after_ we've finished timing
Expand Down Expand Up @@ -278,20 +279,20 @@ fn std_str_str_extend_chars_20(c: &mut Criterion) {

criterion_group!(
compact_str,
compact_str_inline_length,
compact_str_heap_length,
compact_str_very_big_heap_length,
compact_str_reserve_small,
compact_str_reserve_large,
compact_str_clone_small,
compact_str_clone_large_and_modify,
compact_str_extend_chars_empty,
compact_str_extend_chars_short,
compact_str_extend_chars_inline_to_heap_20,
compact_str_extend_chars_heap_20,
compact_str_from_string_inline,
compact_str_from_string_heap,
compact_str_from_string_heap_long
compact_string_inline_length,
compact_string_heap_length,
compact_string_very_big_heap_length,
compact_string_reserve_small,
compact_string_reserve_large,
compact_string_clone_small,
compact_string_clone_large_and_modify,
compact_string_extend_chars_empty,
compact_string_extend_chars_short,
compact_string_extend_chars_inline_to_heap_20,
compact_string_extend_chars_heap_20,
compact_string_from_string_inline,
compact_string_from_string_heap,
compact_string_from_string_heap_long
);
criterion_group!(
std_string,
Expand Down
Loading

0 comments on commit e51bb8e

Please sign in to comment.