Intolerant of truncated data, inelegant failure mode #11

Synchro · 2018-07-30T11:51:43Z

This is something of an edge case, but it's reasonably important if encoded strings are embedded in URLs - incomplete data is better than trashed data. If I encode a string using base62, then remove 1 or more chars from the end of the string, then decode it again, it results in complete garbage output.

This is in contrast to similar encoders that degrade more gracefully, such as PHP's built-in base64 functions. In this example I've chosen an input length that does not result in padding in the base64 encoder, as deleting padding is always harmless:

$str = 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ01234567';

$base62 = new Tuupola\Base62;
$e = $base62->encode($str);
$d = $base62->decode(substr($e, 0, -1));
var_dump($str, $e, $d);

$e = base64_encode($str);
$d = base64_decode(substr($e, 0, -1));
var_dump($str, $e, $d);

Output:

string(60) "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ01234567"
string(81) "4p1Nflx8o0gcd8e3xWM0Idxg3Fuxk4R0eRc9fYLj2B83X5FRXw82YB3KaoyYLuNBn1LJW2KwVqMIAhO5P"
string(60) "���`�G�`�܎���"�d�3i��3y�`��S�d6ϙ�}��6��\&}7	�l�}O�\000�!ۮG�"
string(60) "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ01234567"
string(80) "YWJjZGVmZ2hpamtsbW5vcHFyc3R1dnd4eXpBQkNERUZHSElKS0xNTk9QUVJTVFVWV1hZWjAxMjM0NTY3"
string(59) "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456"

Note that while the base62 output is unusable, the base64 output just loses a single character at the end.

I suspect this may be because the entire input string is treated as a single GMP number, rather than treating it as a byte stream, or encoding in chunks of a few chars at a time. It's noticeable when you encode nearly identical strings - in base62 they result in completely different output after the difference, rather than only differing in the areas where the input strings are different (as seen in base64). For example:

$str1 = 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ01234567';
$str2 = 'abcdefghijklmnopqrstuvwxyz_BCDEFGHIJKLMNOPQRSTUVWXYZ01234567';
var_dump($base62->encode($str1), $base62->encode($str2), base64_encode($str1), base64_encode($str2));

Output (^ indicates difference in output):

string(81) "4p1Nflx8o0gcd8e3xWM0Idxg3Fuxk4R0eRc9fYLj2B83X5FRXw82YB3KaoyYLuNBn1LJW2KwVqMIAhO5P"
string(81) "4p1Nflx8o0gcd8e3xWM0Idxg3Fuxk4R0eRcBcpzqd0KogZYJMSqRuFsDRn11xfrgKLFGaYUmPtLLlF5V9"
                                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
string(80) "YWJjZGVmZ2hpamtsbW5vcHFyc3R1dnd4eXpBQkNERUZHSElKS0xNTk9QUVJTVFVWV1hZWjAxMjM0NTY3"
string(80) "YWJjZGVmZ2hpamtsbW5vcHFyc3R1dnd4eXpfQkNERUZHSElKS0xNTk9QUVJTVFVWV1hZWjAxMjM0NTY3"
                                               ^

The text was updated successfully, but these errors were encountered:

stof · 2021-04-12T16:49:51Z

The reason why base64 only changes in one char is because 64 is a multiple of 8 and a divider of 256, so each set of 4 base64 chars covers a set of 3 bytes in the original string (and so a set of chars). When using base 62, the conversion has to propagate a remainder to the next part of the conversion. You will see similar effect for other bases.

Synchro · 2021-04-12T17:25:34Z

Yes I'm aware of that, but it doesn't affect the robustness. I suspect the robustness problem is because GMP parses the input from right to left, so truncating a single char results in a completely different interpretation of the whole of the rest of the input.

stof · 2021-04-12T18:10:30Z

@Synchro The robustness is precisely because base64 has a local pattern of 4 chars encoding 3 bytes. The way base64 leverages that is by applying padding on the string. Encoding pads the string to a multiple of 3 bytes and decoding removes the padding (and PHP in non-strict base64 decoding can recover from missing = padding chars). But that's thanks to log2(64) = 6 and log2(256) = 8 and 6 * 4 = 8 * 3
Encoding in base 62 yields no such pattern, because a base62 byte does not even encode a round number of bits of the input. So there is no chunking in a base62 output. The output char depends on everything before it in the string (and this means that for decoding, a character depends on everything until the end)

Synchro · 2021-04-12T23:45:23Z

Base64 decoding suffers exactly the same problem if you delete the first character of encoded text, but that's because it's not using GMP and is decoded left to right. From my first example:

echo base64_decode('YWJjZGVmZ2hpamtsbW5vcHFyc3R1dnd4eXpBQkNERUZHSElKS0xNTk9QUVJTVFVWV1hZWjAxMjM0NTY3')."\n";
echo base64_decode('WJjZGVmZ2hpamtsbW5vcHFyc3R1dnd4eXpBQkNERUZHSElKS0xNTk9QUVJTVFVWV1hZWjAxMjM0NTY3');


abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ01234567
X���Y���Z���[���\���]���^�P���Q���R���S���T���U���V�L��
M�

Similarly, we can show that dropping the first character of the base62 encoded data does not result in it corrupting the end of the string:

$base62 = new Tuupola\Base62;
$e = $base62->encode($str);
$d = $base62->decode(substr($e, 0, -1));
$f = $base62->decode(substr($e, 1));
var_dump($str, $e, $d, $f);

string(60) "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ01234567"
string(81) "4p1Nflx8o0gcd8e3xWM0Idxg3Fuxk4R0eRc9fYLj2B83X5FRXw82YB3KaoyYLuNBn1LJW2KwVqMIAhO5P"
string(60) "���`�G�`�܎���"�d�3i��3y�`��S�d6ϙ�}��6��\&}7	�l�}O�\000�!ۮG�"
string(60) "i+���p�TYZ01234567"

(My IDE suppresses some of the invalid chars in the output).

Removing a leading character in base64 acts differently because it will shift the entire encoded bit pattern and it will never resync. In base62 it eventually drifts back into sync after a certain number of characters because of the non-integral bit length; I've not worked out how long that is, but it looks like it's around 50 chars, though I guess it may also drift out of sync again later!

This encoder I wrote many years ago is significantly faster than this one. It cheats by using some padding, chunking 8 input bytes into 11 output bytes, which also makes it much more robust, but at the same time breaks compatibility with "true" base62 encoding, though that's perfectly acceptable if you're in control of both ends.

Anyway, the aim of this ticket was to raise the point that truncation (which is much more likely than losing leading chars) is much more damaging in base62, and perhaps to raise some suggestions about how to deal with that.

stof · 2021-04-13T08:18:59Z

but cheating means that it is not an actual base conversion, meaning that your encoder is not a base62 encoding anymore. It is a custom encoding scheme (that happens to use a 62-char alphabet for its output)

Synchro · 2021-04-13T08:20:34Z

Yes, that's what I said. It is however, still very useful.

tuupola · 2021-04-13T10:02:27Z

Base62 encoding was not invented by me so there is nothing I can do.

Synchro mentioned this issue Mar 19, 2019

Intolerant of truncated data, inelegant failure mode xobotyi/basen#3

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Intolerant of truncated data, inelegant failure mode #11

Intolerant of truncated data, inelegant failure mode #11

Synchro commented Jul 30, 2018 •

edited

Loading

stof commented Apr 12, 2021 •

edited

Loading

Synchro commented Apr 12, 2021

stof commented Apr 12, 2021 •

edited

Loading

Synchro commented Apr 12, 2021

stof commented Apr 13, 2021

Synchro commented Apr 13, 2021

tuupola commented Apr 13, 2021

Intolerant of truncated data, inelegant failure mode #11

Intolerant of truncated data, inelegant failure mode #11

Comments

Synchro commented Jul 30, 2018 • edited Loading

stof commented Apr 12, 2021 • edited Loading

Synchro commented Apr 12, 2021

stof commented Apr 12, 2021 • edited Loading

Synchro commented Apr 12, 2021

stof commented Apr 13, 2021

Synchro commented Apr 13, 2021

tuupola commented Apr 13, 2021

Synchro commented Jul 30, 2018 •

edited

Loading

stof commented Apr 12, 2021 •

edited

Loading

stof commented Apr 12, 2021 •

edited

Loading