-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Intolerant of truncated data, inelegant failure mode #11
Comments
The reason why base64 only changes in one char is because 64 is a multiple of 8 and a divider of 256, so each set of 4 base64 chars covers a set of 3 bytes in the original string (and so a set of chars). When using base 62, the conversion has to propagate a remainder to the next part of the conversion. You will see similar effect for other bases. |
Yes I'm aware of that, but it doesn't affect the robustness. I suspect the robustness problem is because GMP parses the input from right to left, so truncating a single char results in a completely different interpretation of the whole of the rest of the input. |
@Synchro The robustness is precisely because base64 has a local pattern of 4 chars encoding 3 bytes. The way base64 leverages that is by applying padding on the string. Encoding pads the string to a multiple of 3 bytes and decoding removes the padding (and PHP in non-strict base64 decoding can recover from missing |
Base64 decoding suffers exactly the same problem if you delete the first character of encoded text, but that's because it's not using GMP and is decoded left to right. From my first example: echo base64_decode('YWJjZGVmZ2hpamtsbW5vcHFyc3R1dnd4eXpBQkNERUZHSElKS0xNTk9QUVJTVFVWV1hZWjAxMjM0NTY3')."\n";
echo base64_decode('WJjZGVmZ2hpamtsbW5vcHFyc3R1dnd4eXpBQkNERUZHSElKS0xNTk9QUVJTVFVWV1hZWjAxMjM0NTY3');
abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ01234567
X���Y���Z���[���\���]���^�P���Q���R���S���T���U���V�L��
M� Similarly, we can show that dropping the first character of the base62 encoded data does not result in it corrupting the end of the string: $base62 = new Tuupola\Base62;
$e = $base62->encode($str);
$d = $base62->decode(substr($e, 0, -1));
$f = $base62->decode(substr($e, 1));
var_dump($str, $e, $d, $f);
string(60) "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ01234567"
string(81) "4p1Nflx8o0gcd8e3xWM0Idxg3Fuxk4R0eRc9fYLj2B83X5FRXw82YB3KaoyYLuNBn1LJW2KwVqMIAhO5P"
string(60) "���`�G�`����"�d�3i��3y�`��S�d6ϙ�}��6��\&}7 �l�}O�\000�!ۮG�"
string(60) "i+���p�TYZ01234567" (My IDE suppresses some of the invalid chars in the output). Removing a leading character in base64 acts differently because it will shift the entire encoded bit pattern and it will never resync. In base62 it eventually drifts back into sync after a certain number of characters because of the non-integral bit length; I've not worked out how long that is, but it looks like it's around 50 chars, though I guess it may also drift out of sync again later! This encoder I wrote many years ago is significantly faster than this one. It cheats by using some padding, chunking 8 input bytes into 11 output bytes, which also makes it much more robust, but at the same time breaks compatibility with "true" base62 encoding, though that's perfectly acceptable if you're in control of both ends. Anyway, the aim of this ticket was to raise the point that truncation (which is much more likely than losing leading chars) is much more damaging in base62, and perhaps to raise some suggestions about how to deal with that. |
but cheating means that it is not an actual base conversion, meaning that your encoder is not a base62 encoding anymore. It is a custom encoding scheme (that happens to use a 62-char alphabet for its output) |
Yes, that's what I said. It is however, still very useful. |
Base62 encoding was not invented by me so there is nothing I can do. |
This is something of an edge case, but it's reasonably important if encoded strings are embedded in URLs - incomplete data is better than trashed data. If I encode a string using base62, then remove 1 or more chars from the end of the string, then decode it again, it results in complete garbage output.
This is in contrast to similar encoders that degrade more gracefully, such as PHP's built-in base64 functions. In this example I've chosen an input length that does not result in padding in the base64 encoder, as deleting padding is always harmless:
Output:
Note that while the base62 output is unusable, the base64 output just loses a single character at the end.
I suspect this may be because the entire input string is treated as a single GMP number, rather than treating it as a byte stream, or encoding in chunks of a few chars at a time. It's noticeable when you encode nearly identical strings - in base62 they result in completely different output after the difference, rather than only differing in the areas where the input strings are different (as seen in base64). For example:
Output (^ indicates difference in output):
The text was updated successfully, but these errors were encountered: