feat: add support for base64url alphabet (#24)

* fix: decoding of padded input and add length assertions * feat: add support for base64url encoding * test: url safe encoding pad/no pad * feat: add support for base64url decoding * test: url safe decoding pad/no pad * docs: update README and example tests for configurability * docs: update costs for encode/decode - note: based on profiling, it seems that the previous costs were wrong and that the current costs have been the same since the reversed encoding/decoding was fixed in commit cc5b18a. * chore: rename encoder/decoder config names
noir-lang · Oct 30, 2024 · dfed9dd · dfed9dd
1 parent d51837d
commit dfed9dd
Show file tree

Hide file tree

Showing 4 changed files with 545 additions and 40 deletions.
diff --git a/README.md b/README.md
@@ -2,38 +2,80 @@
 
 A Base64 encoding/decoding library written in Noir which can encode arbitrary byte arrays into Base64 and decode Base64-encoded byte arrays (e.g. `"SGVsbG8gV29ybGQ=".as_bytes()`).
 
-# Usage
+## Usage
+### Configuration
+Start by selecting the encoder or decoder for your configuration. These are defined separately so that only one lookup table will be instantiated at a time, since many cases will require either an encoder or a decoder but not both.
 
-### `fn base64_encode`
-Takes an arbitrary byte array as input, unpacks it into Base64 values, then encodes each Base64 value into an ASCII character according to the [standard Base64 alphabet](https://datatracker.ietf.org/doc/html/rfc4648#section-4), to return a byte array representing the Base64 encoding. The encoded result is *not padded*, so padding must be handled separately.
+RFC 4648 specifies multiple alphabets, including the [standard Base 64 Alphabet](https://datatracker.ietf.org/doc/html/rfc4648#section-4) known as `base64` and the ["URL and Filename Safe Alphabet"](https://datatracker.ietf.org/doc/html/rfc4648#section-5) known as `base64url`. It also specifies that [padding](https://datatracker.ietf.org/doc/html/rfc4648#section-3.2) should be required in the general case but can be explicitly omitted as an option.
 
-### `fn base64_decode`
-Takes an ASCII byte array that encodes a Base64 string and decodes it into bytes. Input data is expected to be unpadded, so padding characters will cause decoding to fail.
+Available encoder configurations:
+- `BASE64_ENCODER`: uses the standard alphabet (base64) and adds padding.
+- `BASE64_NO_PAD_ENCODER`: uses the standard alphabet (base64), but omits padding.
+- `BASE64_URL_ENCODER`: uses the "URL and Filename Safe Alphabet" (base64url) and omits padding, which is common for `base64url` when the length is implicitly known, as in this case.
+- `BASE64_URL_WITH_PAD_ENCODER`: uses the "URL and Filename Safe Alphabet" (base64url) and adds padding.
 
-### `fn base64_encode_elements`
-Takes an input byte array of ASCII characters and produces an output byte array of base64-encoded characters. Data is not packed i.e. each output array element maps to a 6-bit base64 character.
+Available decoder configurations:
+- `BASE64_DECODER`: uses the standard alphabet (base64) and expects correct padding.
+- `BASE64_NO_PAD_DECODER`: uses the standard alphabet (base64), but expects all padding characters to have been stripped, which is common for `base64url` when the length is implicitly known, as in this case. A padding character encountered during decoding will trigger an error.
+- `BASE64_URL_DECODER`: uses the "URL and Filename Safe Alphabet" (base64url), but expects all padding characters to have been stripped. A padding character encountered during decoding will trigger an error.
+- `BASE64_URL_WITH_PAD_DECODER`: uses the "URL and Filename Safe Alphabet" (base64url) and expects correct padding.
 
-### `fn base64_decode_elements`
-Takes an input byte array of base64 characters and produces an output byte array of ASCII characters. Input data is not packed i.e. each input element maps to a 6-bit base64 character. Input data is expected not to contain padding characters. Padding characters will cause decoding to fail.
+### `fn encode`
+Takes an arbitrary byte array as input, encodes it in Base64 according to the alphabet and padding rules specified by the configuration, then encodes each Base64 character into UTF-8 to return a byte array representing the Base64 encoding.
 
-### Example usage
+```
+// bytes: [u8; N]
+let base64 = BASE64_ENCODER.encode(bytes);
+```
+
+### `fn decode`
+Takes a utf-8 byte array that encodes a Base64 string and attempts to decoded it into bytes according to the provided configuration specifying the alphabet and padding rules.
+
+```
+// base64: [u8; N]
+let bytes = BASE64_DECODER.decode(base64);
+```
+
+## Example usage
 (see tests in `lib.nr` for more examples)
 
 ```
-use dep::noir_base64;
 fn encode_and_decode() {
     let input: str<88> = "The quick brown fox jumps over the lazy dog, while 42 ravens perch atop a rusty mailbox.";
-    let base64_encoded: str<118> = "VGhlIHF1aWNrIGJyb3duIGZveCBqdW1wcyBvdmVyIHRoZSBsYXp5IGRvZywgd2hpbGUgNDIgcmF2ZW5zIHBlcmNoIGF0b3AgYSBydXN0eSBtYWlsYm94Lg";
+    let base64_encoded = "VGhlIHF1aWNrIGJyb3duIGZveCBqdW1wcyBvdmVyIHRoZSBsYXp5IGRvZywgd2hpbGUgNDIgcmF2ZW5zIHBlcmNoIGF0b3AgYSBydXN0eSBtYWlsYm94Lg==";
 
-    let encoded:[u8; 118] = noir_base64::base64_encode(input.as_bytes());
+    let encoded:[u8; 120] = noir_base64::BASE64_ENCODER.encode(input.as_bytes());
     assert(encoded == base64_encoded.as_bytes());
 
-    let decoded: [u8; 88] = noir_base64::base64_decode(encoded);
+    let decoded: [u8; 88] = noir_base64::BASE64_DECODER.decode(encoded);
     assert(decoded == input.as_bytes());
 }
 ```
 
-# Costs
 
-- `base64_encode` will encode an array of 88 bytes in ~1182 gates, plus a ~64 gate cost to initialize the encoding lookup table (the initialization cost is incurred once regardless of the number of encodings).
-- `base64_decode` will decode an array of 118 bytes in ~2150 gates, plus a ~256 gate cost to initialize the decoding lookup table (the initialization cost is incurred once regardless of the number of decodings).
+## Costs
+
+All of the benchmarks below are for the [Barretenberg proving backend](https://github.com/AztecProtocol/aztec-packages/tree/master/barretenberg). 
+
+After the initial setup cost it is often cheaper to decode than to encode, as shown by the numbers below where the encode/decode were run over the same pairs of unencoded and base64-encoded text.
+
+| UTF-8 Length | Base64 Length | # times | # Gates to Encode | # Gates to Decode |
+| ------------ | ------------- | ------- | ----------------- | ----------------- |
+| 12           | 16            | 1       | 2946              | 1065              |
+| 12           | 16            | 2       | 3057              | 1114              |
+| 12           | 16            | 3       | 3166              | 1163              |
+| 610          | 816           | 1       | 7349              | 8062              |
+| 610          | 816           | 2       | 10993             | 9181              |
+| 610          | 816           | 3       | 14597             | 10239             |
+
+### `encode`
+Costs are equivalent for all encoder configurations. 
+
+- encoding an array of 12 bytes into 16 base64 characters requires ~110 gates plus an initial setup cost of ~2836 gates. (Gate counts for encoding the same array 1, 2, and 3 were 2946, 3057, 3166 respectively.)
+- encoding an array of 610 input bytes requires ~3625 gates plus an initial setup cost of ~3700 gates. (Gate counts for encoding the same array 1, 2, 3, 4 times were 7349, 10993, 14597, and 18200 respectively.)
+
+### `decode`
+Decoding padded inputs costs 1-2 gates more than decoding unpadded inputs. Since the difference is marginal, the numbers below are only for the padded case.
+
+- decoding an array of 16 base64 characters bytes into 12 bytes requires ~49 gates plus an initial setup cost of ~1016 gates. (Gate counts for encoding the same array 1, 2, and 3 times were 1065, 1114, and 1163 respectively.)
+- decoding an array of 816 base64 characters (including padding) into 610 input bytes requires ~1060 gates plus an initial setup cost of ~7000 gates. (Gate counts for decoding the same array 1, 2, 3, 4 times were 8062, 9181, 10239, and 11298 respectively.)