Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intolerant of truncated data, inelegant failure mode #11

Open
Synchro opened this issue Jul 30, 2018 · 7 comments
Open

Intolerant of truncated data, inelegant failure mode #11

Synchro opened this issue Jul 30, 2018 · 7 comments

Comments

@Synchro
Copy link

Synchro commented Jul 30, 2018

This is something of an edge case, but it's reasonably important if encoded strings are embedded in URLs - incomplete data is better than trashed data. If I encode a string using base62, then remove 1 or more chars from the end of the string, then decode it again, it results in complete garbage output.

This is in contrast to similar encoders that degrade more gracefully, such as PHP's built-in base64 functions. In this example I've chosen an input length that does not result in padding in the base64 encoder, as deleting padding is always harmless:

$str = 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ01234567';

$base62 = new Tuupola\Base62;
$e = $base62->encode($str);
$d = $base62->decode(substr($e, 0, -1));
var_dump($str, $e, $d);

$e = base64_encode($str);
$d = base64_decode(substr($e, 0, -1));
var_dump($str, $e, $d);

Output:

string(60) "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ01234567"
string(81) "4p1Nflx8o0gcd8e3xWM0Idxg3Fuxk4R0eRc9fYLj2B83X5FRXw82YB3KaoyYLuNBn1LJW2KwVqMIAhO5P"
string(60) "���`�G�`�܎���"�d�3i��3y�`��S�d6ϙ�}��6��\&}7	�l�}O�\000�!ۮG�"
string(60) "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ01234567"
string(80) "YWJjZGVmZ2hpamtsbW5vcHFyc3R1dnd4eXpBQkNERUZHSElKS0xNTk9QUVJTVFVWV1hZWjAxMjM0NTY3"
string(59) "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456"

Note that while the base62 output is unusable, the base64 output just loses a single character at the end.

I suspect this may be because the entire input string is treated as a single GMP number, rather than treating it as a byte stream, or encoding in chunks of a few chars at a time. It's noticeable when you encode nearly identical strings - in base62 they result in completely different output after the difference, rather than only differing in the areas where the input strings are different (as seen in base64). For example:

$str1 = 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ01234567';
$str2 = 'abcdefghijklmnopqrstuvwxyz_BCDEFGHIJKLMNOPQRSTUVWXYZ01234567';
var_dump($base62->encode($str1), $base62->encode($str2), base64_encode($str1), base64_encode($str2));

Output (^ indicates difference in output):

string(81) "4p1Nflx8o0gcd8e3xWM0Idxg3Fuxk4R0eRc9fYLj2B83X5FRXw82YB3KaoyYLuNBn1LJW2KwVqMIAhO5P"
string(81) "4p1Nflx8o0gcd8e3xWM0Idxg3Fuxk4R0eRcBcpzqd0KogZYJMSqRuFsDRn11xfrgKLFGaYUmPtLLlF5V9"
                                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
string(80) "YWJjZGVmZ2hpamtsbW5vcHFyc3R1dnd4eXpBQkNERUZHSElKS0xNTk9QUVJTVFVWV1hZWjAxMjM0NTY3"
string(80) "YWJjZGVmZ2hpamtsbW5vcHFyc3R1dnd4eXpfQkNERUZHSElKS0xNTk9QUVJTVFVWV1hZWjAxMjM0NTY3"
                                               ^
@stof
Copy link

stof commented Apr 12, 2021

The reason why base64 only changes in one char is because 64 is a multiple of 8 and a divider of 256, so each set of 4 base64 chars covers a set of 3 bytes in the original string (and so a set of chars). When using base 62, the conversion has to propagate a remainder to the next part of the conversion. You will see similar effect for other bases.

@Synchro
Copy link
Author

Synchro commented Apr 12, 2021

Yes I'm aware of that, but it doesn't affect the robustness. I suspect the robustness problem is because GMP parses the input from right to left, so truncating a single char results in a completely different interpretation of the whole of the rest of the input.

@stof
Copy link

stof commented Apr 12, 2021

@Synchro The robustness is precisely because base64 has a local pattern of 4 chars encoding 3 bytes. The way base64 leverages that is by applying padding on the string. Encoding pads the string to a multiple of 3 bytes and decoding removes the padding (and PHP in non-strict base64 decoding can recover from missing = padding chars). But that's thanks to log2(64) = 6 and log2(256) = 8 and 6 * 4 = 8 * 3
Encoding in base 62 yields no such pattern, because a base62 byte does not even encode a round number of bits of the input. So there is no chunking in a base62 output. The output char depends on everything before it in the string (and this means that for decoding, a character depends on everything until the end)

@Synchro
Copy link
Author

Synchro commented Apr 12, 2021

Base64 decoding suffers exactly the same problem if you delete the first character of encoded text, but that's because it's not using GMP and is decoded left to right. From my first example:

echo base64_decode('YWJjZGVmZ2hpamtsbW5vcHFyc3R1dnd4eXpBQkNERUZHSElKS0xNTk9QUVJTVFVWV1hZWjAxMjM0NTY3')."\n";
echo base64_decode('WJjZGVmZ2hpamtsbW5vcHFyc3R1dnd4eXpBQkNERUZHSElKS0xNTk9QUVJTVFVWV1hZWjAxMjM0NTY3');


abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ01234567
X���Y���Z���[���\���]���^�P���Q���R���S���T���U���V�L��
M�

Similarly, we can show that dropping the first character of the base62 encoded data does not result in it corrupting the end of the string:

$base62 = new Tuupola\Base62;
$e = $base62->encode($str);
$d = $base62->decode(substr($e, 0, -1));
$f = $base62->decode(substr($e, 1));
var_dump($str, $e, $d, $f);

string(60) "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ01234567"
string(81) "4p1Nflx8o0gcd8e3xWM0Idxg3Fuxk4R0eRc9fYLj2B83X5FRXw82YB3KaoyYLuNBn1LJW2KwVqMIAhO5P"
string(60) "���`�G�`�܎���"�d�3i��3y�`��S�d6ϙ�}��6��\&}7	�l�}O�\000�!ۮG�"
string(60) "i+���p�TYZ01234567"

(My IDE suppresses some of the invalid chars in the output).

Removing a leading character in base64 acts differently because it will shift the entire encoded bit pattern and it will never resync. In base62 it eventually drifts back into sync after a certain number of characters because of the non-integral bit length; I've not worked out how long that is, but it looks like it's around 50 chars, though I guess it may also drift out of sync again later!

This encoder I wrote many years ago is significantly faster than this one. It cheats by using some padding, chunking 8 input bytes into 11 output bytes, which also makes it much more robust, but at the same time breaks compatibility with "true" base62 encoding, though that's perfectly acceptable if you're in control of both ends.

Anyway, the aim of this ticket was to raise the point that truncation (which is much more likely than losing leading chars) is much more damaging in base62, and perhaps to raise some suggestions about how to deal with that.

@stof
Copy link

stof commented Apr 13, 2021

but cheating means that it is not an actual base conversion, meaning that your encoder is not a base62 encoding anymore. It is a custom encoding scheme (that happens to use a 62-char alphabet for its output)

@Synchro
Copy link
Author

Synchro commented Apr 13, 2021

Yes, that's what I said. It is however, still very useful.

@tuupola
Copy link
Owner

tuupola commented Apr 13, 2021

Base62 encoding was not invented by me so there is nothing I can do.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants