Still one of the most widely used character encodings for electronic communication is ASCII , the American Standard Code for Information Interchange. The output of man ascii
shown below lists the original 7-bit US-ASCII code covering only latin characters used in English plus a lot of control characters used in the early days of modem links, teleprinters and console terminals.
Oct | Dec | Hex | Char | Oct | Dec | Hex | Char |
---|---|---|---|---|---|---|---|
000 | 0 | 00 | NUL '\0' (null character) | 020 | 16 | 10 | DLE (data link escape) |
001 | 1 | 01 | SOH (start of heading) | 021 | 17 | 11 | DC1 (device control 1) |
002 | 2 | 02 | STX (start of text) | 022 | 18 | 12 | DC2 (device control 2) |
003 | 3 | 03 | ETX (end of text) | 023 | 19 | 13 | DC3 (device control 3) |
004 | 4 | 04 | EOT (end of transmission) | 024 | 20 | 14 | DC4 (device control 4) |
005 | 5 | 05 | ENQ (enquiry) | 025 | 21 | 15 | NAK (negative ack.) |
006 | 6 | 06 | ACK (acknowledge) | 026 | 22 | 16 | SYN (synchronous idle) |
007 | 7 | 07 | BEL '\a' (bell) | 027 | 23 | 17 | ETB (end of trans. blk) |
010 | 8 | 08 | BS '\b' (backspace) | 030 | 24 | 18 | CAN (cancel) |
011 | 9 | 09 | HT '\t' (horizontal tab) | 031 | 25 | 19 | EM (end of medium) |
012 | 10 | 0A | LF '\n' (new line) | 032 | 26 | 1A | SUB (substitute) |
013 | 11 | 0B | VT '\v' (vertical tab) | 033 | 27 | 1B | ESC (escape) |
014 | 12 | 0C | FF '\f' (form feed) | 034 | 28 | 1C | FS (file separator) |
015 | 13 | 0D | CR '\r' (carriage ret) | 035 | 29 | 1D | GS (group separator) |
016 | 14 | 0E | SO (shift out) | 036 | 30 | 1E | RS (record separator) |
017 | 15 | 0F | SI (shift in) | 037 | 31 | 1F | US (unit separator) |
SPACE and numbers 0 to 9:
Oct | Dec | Hex | Char | Oct | Dec | Hex | Char |
---|---|---|---|---|---|---|---|
040 | 32 | 20 | SPACE | 060 | 48 | 30 | 0 |
041 | 33 | 21 | ! | 061 | 49 | 31 | 1 |
042 | 34 | 22 | " | 062 | 50 | 32 | 2 |
043 | 35 | 23 | # | 063 | 51 | 33 | 3 |
044 | 36 | 24 | $ | 064 | 52 | 34 | 4 |
045 | 37 | 25 | % | 065 | 53 | 35 | 5 |
046 | 38 | 26 | & | 066 | 54 | 36 | 6 |
047 | 39 | 27 | ' | 067 | 55 | 37 | 7 |
050 | 40 | 28 | ( | 070 | 56 | 38 | 8 |
051 | 41 | 29 | ) | 071 | 57 | 39 | 9 |
052 | 42 | 2A | * | 072 | 58 | 3A | : |
053 | 43 | 2B | + | 073 | 59 | 3B | ; |
054 | 44 | 2C | , | 074 | 60 | 3C | < |
055 | 45 | 2D | - | 075 | 61 | 3D | = |
056 | 46 | 2E | . | 076 | 62 | 3E | > |
057 | 47 | 2F | / | 077 | 63 | 3F | ? |
Capital letters A to Z and small letters a to z:
Oct | Dec | Hex | Char | Oct | Dec | Hex | Char |
---|---|---|---|---|---|---|---|
100 | 64 | 40 | @ | 140 | 96 | 60 | ` |
101 | 65 | 41 | A | 141 | 97 | 61 | a |
102 | 66 | 42 | B | 142 | 98 | 62 | b |
103 | 67 | 43 | C | 143 | 99 | 63 | c |
104 | 68 | 44 | D | 144 | 100 | 64 | d |
105 | 69 | 45 | E | 145 | 101 | 65 | e |
106 | 70 | 46 | F | 146 | 102 | 66 | f |
107 | 71 | 47 | G | 147 | 103 | 67 | g |
110 | 72 | 48 | H | 150 | 104 | 68 | h |
111 | 73 | 49 | I | 151 | 105 | 69 | i |
112 | 74 | 4A | J | 152 | 106 | 6A | j |
113 | 75 | 4B | K | 153 | 107 | 6B | k |
114 | 76 | 4C | L | 154 | 108 | 6C | l |
115 | 77 | 4D | M | 155 | 109 | 6D | m |
116 | 78 | 4E | N | 156 | 110 | 6E | n |
117 | 79 | 4F | O | 157 | 111 | 6F | o |
120 | 80 | 50 | P | 160 | 112 | 70 | p |
121 | 81 | 51 | Q | 161 | 113 | 71 | q |
122 | 82 | 52 | R | 162 | 114 | 72 | r |
123 | 83 | 53 | S | 163 | 115 | 73 | s |
124 | 84 | 54 | T | 164 | 116 | 74 | t |
125 | 85 | 55 | U | 165 | 117 | 75 | u |
126 | 86 | 56 | V | 166 | 118 | 76 | v |
127 | 87 | 57 | W | 167 | 119 | 77 | w |
130 | 88 | 58 | X | 170 | 120 | 78 | x |
131 | 89 | 59 | Y | 171 | 121 | 79 | y |
132 | 90 | 5A | Z | 172 | 122 | 7A | z |
133 | 91 | 5B | [ | 173 | 123 | 7B | { |
134 | 92 | 5C | \ '\' | 174 | 124 | 7C | | |
135 | 93 | 5D | ] | 175 | 125 | 7D | } |
136 | 94 | 5E | ^ | 176 | 126 | 7E | ~ |
137 | 95 | 5F | _ | 177 | 127 | 7F | DEL |
The extended ASCII code uses 8 bits thus increasing the code table to 2^8 = 256
characters in order to cover characters in other languages, e.g. the ISO 8859 Latin-1 extension that covers latin characters in other western languages: à ä e e ê ô ö ù ü …
Python 1: We learn how to manipulate ASCII characters and strings
The ordinal value of a single ASCII character can be determined with the ord()
function:
>>> print(ord('0'), ord('9'))
48 57
>>> print(ord('A'), ord('Z'))
65 90
>>> print(ord('a'), ord('z'))
97 122
When the ordinal value is known, the corresponding ASCII character can be generated with the chr()
function
>>> print(chr(48), chr(57))
0 9
>>> print(chr(65), chr(90))
A Z
We generate a byte array from an ASCII string containing the numbers 0
to 9
b_num = bytearray("0123456789", "ascii")
>>> print(b_num, len(b_num))
bytearray(b'0123456789') 10
>>> [c for c in b_num]
[48, 49, 50, 51, 52, 53, 54, 55, 56, 57] # dec
>>> b_num[9]
57
>>> [format(c, '02x') for c in b_num]
['30', '31', '32', '33', '34', '35', '36', '37', '38', '39'] # hex
>>> b_num.decode("ascii") # convert back from bytearray to ASCII string
'0123456789'
We generate a byte array from an ASCII string containing the uppercase characters A
to Z
>>> b_chr = bytearray("ABCDEFGHIJKLMNOPQRSTUVWXYZ", "ascii")
>>> print(b_chr, len(b_chr))
bytearray(b'ABCDEFGHIJKLMNOPQRSTUVWXYZ') 26
>>> [c for c in b_chr]
[65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90]
>>> b_chr[0]
65
>>> [format(c, '02x') for c in b_chr]
['41', '42', '43', '44', '45', '46', '47', '48', '49', '4a', '4b', '4c', '4d', '4e', '4f', '50', '51', '52', '53', '54', '55', '56', '57', '58', '59', '5a']
>>> b_chr.decode("ascii") # convert back from bytearray to ASCII string
'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
We load an ASCII string into a byte array, convert the byte array into a hex string and convert the hex string back into an ASCII string.
>>> import binascii
>>> num_str = '0123456789'
>>> num_arr = bytearray(num_str, 'ascii')
>>> print(num_arr)
bytearray(b'0123456789')
>>> num_hex = binascii.hexlify(num_arr)
>>> print(num_hex)
b'30313233343536373839'
>>> binascii.unhexlify(num_hex)
b'0123456789'
The following python script ascii.py
capitalizes lowercase characters in a byte array by unsetting bit 5
in the ASCII encoding and then converts all uppercase characters to lowercase by setting bit 5
.
#!/usr/bin/python3
b_arr = bytearray("Oh my god! There are 20 students in the classroom, (\x07)", "ascii")
print(b_arr)
# Capitalize lower case characters by unsetting bit 5
for i in range(0, len(b_arr)):
if b_arr[i] > 0x60 and b_arr[i] < 0x7b:
b_arr[i] &= 0b11011111
print(b_arr)
# Convert all uppercase characters to lowercase by setting bit 5
for i in range(0, len(b_arr)):
if b_arr[i] > 0x40 and b_arr[i] < 0x5b:
b_arr[i] |= 0x20
print(b_arr)
The output of the script is
python3 ascii.py
bytearray(b'Oh my god! There are 20 students in the classroom, (\x07)')
bytearray(b'OH MY GOD! THERE ARE 20 STUDENTS IN THE CLASSROOM, (\x07)')
bytearray(b'oh my god! there are 20 students in the classroom, (\x07)')
As a comparison we show how conversion to uppercase or lowercase can be easily done using Python string functions.
>>> b_str = 'Oh my god! There are 20 students in the classroom, (\x07)'
>>> b_str.upper()
'OH MY GOD! THERE ARE 20 STUDENTS IN THE CLASSROOM, (\x07)'
>>> b_str.lower()
'oh my god! there are 20 students in the classroom, (\x07)'
The 8-bit Unicode Transformation Format UTF-8 is a variable width character encoding capable of encoding all 1,112,064 valid code points in the Unicode (Universal Coded Character Set) using one to four 8-bit bytes.
It was designed for backward compatibility with ASCII. Code points with lower numerical values, which tend to occur more frequently, are encoded using fewer bytes. The first 128 characters of Unicode, which correspond one-to-one with ASCII, are encoded using a single byte with the same binary value as ASCII, so that valid ASCII text is valid UTF-8-encoded Unicode as well. Since ASCII bytes do not occur when encoding non-ASCII code points into UTF-8, UTF-8 is safe to use within most programming and document languages that interpret certain ASCII characters in a special way, such as /
(slash) in filenames, \
(backslash) in escape sequences, and %
in printf.
Since the restriction of the Unicode code-space to 21-bit values in 2003, UTF-8 is defined to encode code points in one to four bytes, depending on the number of significant bits in the numerical value of the code point. The following table shows the structure of the encoding. The x characters are replaced by the bits of the code point. If the number of significant bits is no more than seven, the first line applies; if no more than 11 bits, the second line applies, and so on.
Number of Bytes | Bits for code point | First code point | Last code point | Byte 1 | Byte 2 | Byte 3 | Byte 4 |
---|---|---|---|---|---|---|---|
1 | 7 | U+0000 | U+007F | 0xxxxxxx | |||
2 | 11 | U+0080 | U+07FF | 110xxxxx | 10xxxxxx | ||
3 | 16 | U+0800 | U+FFFF | 1110xxxx | 10xxxxxx | 10xxxxxx | |
4 | 21 | U+10000 | U+10FFFF | 11110xxx | 10xxxxxx | 10xxxxxx | 10xxxxxx |
- The first 128 characters (US-ASCII) need one byte.
- The next 1,920 characters need two bytes to encode, which covers the remainder of almost all Latin-script alphabets, and also Greek, Cyrillic, Coptic, Armenian, Hebrew, Arabic, Syriac, Thaana and N'Ko alphabets, as well as Combining Diacritical Marks.
- Three bytes are needed for characters in the rest of the Basic Multilingual Plane, which contains virtually all characters in common use, including most Chinese, Japanese and Korean characters.
- Four bytes are needed for characters in the other planes of Unicode, which include less common CJK characters, various historic scripts, mathematical symbols, and emoji (pictographic symbols).
Python 2: We analyze a some UTF-8 encoding examples.
The ord()
and chr()
functions can also be used with UTF-8 characters:
>>> ord('€')
8364
>>> chr(8364)
'€'
Next a very simple example containing the ASCII character a
, the German character ä
, the French character à
and a latin small character a with a ring below ḁ
:
>>> import binascii
>>> chr_str = u'aäàḁ'
>>> print(chr_str, len(chr_str))
aäàḁ 4
>>> [c for c in chr_str]
['a', 'ä', 'à', 'ḁ']
>>> [ord(c) for c in chr_str]
[97, 228, 224, 7681]
>>> [binascii.hexlify(bytearray(c, 'utf-8')) for c in chr_str]
[b'61', b'c3a4', b'c3a0', b'e1b881']
>>> chr_arr = bytearray(chr_str, 'utf-8')
>>> print(chr_arr, len(chr_arr))
bytearray(b'a\xc3\xa4\xc3\xa0\xe1\xb8\x81') 8
>>> [format(x, '08b') for x in chr_arr]
['01100001', '11000011', '10100100', '11000011', '10100000', '11100001', '10111000', '10000001']
>>> chr_hex = binascii.hexlify(chr_arr)
>>> print(chr_hex, len(chr_hex))
b'61c3a4c3a0e1b881' 16
Here is the UTF-8 encoding of a German pangram containing 35 different characters:
>>> import binascii
>>> chr_str = u'„Fix, Schwyz!“, quäkt Jürgen blöd vom Paß'
>>> print(chr_str, len(chr_str))
„Fix, Schwyz!“, quäkt Jürgen blöd vom Paß 41
>>> [c for c in chr_str]
['„', 'F', 'i', 'x', ',', ' ', 'S', 'c', 'h', 'w', 'y', 'z', '!', '“', ',', ' ', 'q', 'u', 'ä', 'k', 't', ' ', 'J', 'ü', 'r', 'g', 'e', 'n', ' ', 'b', 'l', 'ö', 'd', ' ', 'v', 'o', 'm', ' ', 'P', 'a', 'ß']
[binascii.hexlify(bytearray(c, 'utf-8')) for c in chr_str]
[b'e2809e', b'46', b'69', b'78', b'2c', b'20', b'53', b'63', b'68', b'77', b'79', b'7a', b'21', b'e2809c', b'2c', b'20', b'71', b'75', b'c3a4', b'6b', b'74', b'20', b'4a', b'c3bc', b'72', b'67', b'65', b'6e', b'20', b'62', b'6c', b'c3b6', b'64', b'20', b'76', b'6f', b'6d', b'20', b'50', b'61', b'c39f']
>>> chr_arr = bytearray(chr_str, 'utf-8')
>>> print(chr_arr, len(chr_arr))
bytearray(b'\xe2\x80\x9eFix, Schwyz!\xe2\x80\x9c, qu\xc3\xa4kt J\xc3\xbcrgen bl\xc3\xb6d vom Pa\xc3\x9f') 49
chr_hex = binascii.hexlify(chr_arr)
>>> print(chr_hex, len(chr_hex))
b'e2809e4669782c2053636877797a21e2809c2c207175c3a46b74204ac3bc7267656e20626cc3b66420766f6d205061c39f' 98
And here the UTF-8 encoding of a French pangram containing all 45 characters of the French language:
>> import binascii
>>> chr_str = u'Dès Noël, où un zéphyr haï me vêt de glaçons würmiens, je dîne d’exquis rôtis de bœuf au kir, à l’aÿ d’âge mûr, &cætera.'
>>> print(chr_str, len(chr_str))
Dès Noël, où un zéphyr haï me vêt de glaçons würmiens, je dîne d’exquis rôtis de bœuf au kir, à l’aÿ d’âge mûr, &cætera. 120
>>> [c for c in chr_str]
['D', 'è', 's', ' ', 'N', 'o', 'ë', 'l', ',', ' ', 'o', 'ù', ' ', 'u', 'n', ' ', 'z', 'é', 'p', 'h', 'y', 'r', ' ', 'h', 'a', 'ï', ' ', 'm', 'e', ' ', 'v', 'ê', 't', ' ', 'd', 'e', ' ', 'g', 'l', 'a', 'ç', 'o', 'n', 's', ' ', 'w', 'ü', 'r', 'm', 'i', 'e', 'n', 's', ',', ' ', 'j', 'e', ' ', 'd', 'î', 'n', 'e', ' ', 'd', '’', 'e', 'x', 'q', 'u', 'i', 's', ' ', 'r', 'ô', 't', 'i', 's', ' ', 'd', 'e', ' ', 'b', 'œ', 'u', 'f', ' ', 'a', 'u', ' ', 'k', 'i', 'r', ',', ' ', 'à', ' ', 'l', '’', 'a', 'ÿ', ' ', 'd', '’', 'â', 'g', 'e', ' ', 'm', 'û', 'r', ',', ' ', '&', 'c', 'æ', 't', 'e', 'r', 'a', '.']
[binascii.hexlify(bytearray(c, 'utf-8')) for c in chr_str]
[b'44', b'c3a8', b'73', b'20', b'4e', b'6f', b'c3ab', b'6c', b'2c', b'20', b'6f', b'c3b9', b'20', b'75', b'6e', b'20', b'7a', b'c3a9', b'70', b'68', b'79', b'72', b'20', b'68', b'61', b'c3af', b'20', b'6d', b'65', b'20', b'76', b'c3aa', b'74', b'20', b'64', b'65', b'20', b'67', b'6c', b'61', b'c3a7', b'6f', b'6e', b'73', b'20', b'77', b'c3bc', b'72', b'6d', b'69', b'65', b'6e', b'73', b'2c', b'20', b'6a', b'65', b'20', b'64', b'c3ae', b'6e', b'65', b'20', b'64', b'e28099', b'65', b'78', b'71', b'75', b'69', b'73', b'20', b'72', b'c3b4', b'74', b'69', b'73', b'20', b'64', b'65', b'20', b'62', b'c593', b'75', b'66', b'20', b'61', b'75', b'20', b'6b', b'69', b'72', b'2c', b'20', b'c3a0', b'20', b'6c', b'e28099', b'61', b'c3bf', b'20', b'64', b'e28099', b'c3a2', b'67', b'65', b'20', b'6d', b'c3bb', b'72', b'2c', b'20', b'26', b'63', b'c3a6', b'74', b'65', b'72', b'61', b'2e']
>>> chr_arr = bytearray(chr_str, 'utf-8')
>>> print(chr_arr, len(chr_arr))
bytearray(b'D\xc3\xa8s No\xc3\xabl, o\xc3\xb9 un z\xc3\xa9phyr ha\xc3\xaf me v\xc3\xaat de gla\xc3\xa7ons w\xc3\xbcrmiens, je d\xc3\xaene d\xe2\x80\x99exquis r\xc3\xb4tis de b\xc5\x93uf au kir, \xc3\xa0 l\xe2\x80\x99a\xc3\xbf d\xe2\x80\x99\xc3\xa2ge m\xc3\xbbr, &c\xc3\xa6tera.') 142
>>> chr_hex = binascii.hexlify(chr_arr)
>>> print(chr_hex, len(chr_hex))
b'44c3a873204e6fc3ab6c2c206fc3b920756e207ac3a970687972206861c3af206d652076c3aa7420646520676c61c3a76f6e732077c3bc726d69656e732c206a652064c3ae6e652064e280996578717569732072c3b47469732064652062c5937566206175206b69722c20c3a0206ce2809961c3bf2064e28099c3a26765206dc3bb722c202663c3a6746572612e' 284
Finally the UTF-8 encoding of a Russian pangram containing all 33 characters of the Russian language:
>>> import binascii
>>> chr_str = u'Эй, жлоб! Где туз? Прячь юных съёмщиц в шкаф'
>>> print(chr_str, len(chr_str))
Эй, жлоб! Где туз? Прячь юных съёмщиц в шкаф 44
>>> [binascii.hexlify(bytearray(c, 'utf-8')) for c in chr_str]
[b'd0ad', b'd0b9', b'2c', b'20', b'd0b6', b'd0bb', b'd0be', b'd0b1', b'21', b'20', b'd093', b'd0b4', b'd0b5', b'20', b'd182', b'd183', b'd0b7', b'3f', b'20', b'd09f', b'd180', b'd18f', b'd187', b'd18c', b'20', b'd18e', b'd0bd', b'd18b', b'd185', b'20', b'd181', b'd18a', b'd191', b'd0bc', b'd189', b'd0b8', b'd186', b'20', b'd0b2', b'20', b'd188', b'd0ba', b'd0b0', b'd184']
>>> chr_arr = bytearray(chr_str, 'utf-8')
>>> print(chr_arr, len(chr_arr))
bytearray(b'\xd0\xad\xd0\xb9, \xd0\xb6\xd0\xbb\xd0\xbe\xd0\xb1! \xd0\x93\xd0\xb4\xd0\xb5 \xd1\x82\xd1\x83\xd0\xb7? \xd0\x9f\xd1\x80\xd1\x8f\xd1\x87\xd1\x8c \xd1\x8e\xd0\xbd\xd1\x8b\xd1\x85 \xd1\x81\xd1\x8a\xd1\x91\xd0\xbc\xd1\x89\xd0\xb8\xd1\x86 \xd0\xb2 \xd1\x88\xd0\xba\xd0\xb0\xd1\x84') 77
>>> chr_hex = binascii.hexlify(chr_arr)
>>> print(chr_hex, len(chr_hex))
b'd0add0b92c20d0b6d0bbd0bed0b12120d093d0b4d0b520d182d183d0b73f20d09fd180d18fd187d18c20d18ed0bdd18bd18520d181d18ad191d0bcd189d0b8d18620d0b220d188d0bad0b0d184' 154
The Base64 scheme encodes N
binary bytes into 4 * ⌈N/3
⌉ printable ASCII characters and thus allows to carry data stored in binary formats across channels that only reliably support 7-bit US-ASCII text content. The encoding is shown below:
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
| Binary Byte A | Binary Byte B | Binary Byte C |
+--+--+--+---+---+---+--+--+--+--+--+---+---+---+--+--+--+--+--+--+--+--+
A0 A1 A2 A3 A4 A5 A6 A7 B0 B1 B2 B3 B4 B5 B6 B7 C0 C1 C2 C3 C4 C5 C6 C7 Bits
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
| Base64 Char 1 | Base64 Char 2 | Base64 Char 3 | Base64 Char 4 |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
The 24 bits A0..A7, B0..B7, C0..C7
of 3 consecutive bytes A
, B
and C
are split into 4 groups of 6 bits each that are then mapped to the 64 printable US-ASCII characters A..Z
, a..z
, 0..9
, +
and /
according to the table listed below.
I | Binary | Char | I | Binary | Char | I | Binary | Char | I | Binary | Char |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 000000 | A | 16 | 010000 | Q | 32 | 100000 | g | 48 | 110000 | w |
1 | 000001 | B | 17 | 010001 | R | 33 | 100001 | h | 49 | 110001 | x |
2 | 000010 | C | 18 | 010010 | S | 34 | 100010 | i | 50 | 110010 | y |
3 | 000011 | D | 19 | 010011 | T | 35 | 100011 | j | 51 | 110011 | z |
4 | 000100 | E | 20 | 010100 | U | 36 | 100100 | k | 52 | 110100 | 0 |
5 | 000101 | F | 21 | 010101 | V | 37 | 100101 | l | 53 | 110101 | 1 |
6 | 000110 | G | 22 | 010110 | W | 38 | 100110 | m | 54 | 110110 | 2 |
7 | 000111 | H | 23 | 010111 | X | 39 | 100111 | n | 55 | 110111 | 3 |
8 | 001000 | I | 24 | 011000 | Y | 40 | 101000 | o | 56 | 111000 | 4 |
9 | 001001 | J | 25 | 011001 | Z | 41 | 101001 | p | 57 | 111001 | 5 |
10 | 001010 | K | 26 | 011010 | a | 42 | 101010 | q | 58 | 111010 | 6 |
11 | 001011 | L | 27 | 011011 | b | 43 | 101011 | r | 59 | 111011 | 7 |
12 | 001100 | M | 28 | 011100 | c | 44 | 101100 | s | 60 | 111100 | 8 |
13 | 001101 | N | 29 | 011101 | d | 45 | 101101 | t | 61 | 111101 | 9 |
14 | 001110 | O | 30 | 011110 | e | 46 | 101110 | u | 62 | 111110 | + |
15 | 001111 | P | 31 | 011111 | f | 47 | 101111 | v | 63 | 111111 | / |
If the total number N
of bytes contained in a binary blob is not an exact multiple of three, i.e. N mod 3 != 0
then one or two bytes remain at the end of the data block which have to be processed separately.:
For the case N mod 3 = 2
, the remaining 16 bits A0..A7, B0..B7
are padded by two zero bits and the resulting 18 bits can then be mapped to 3 Base64 characters. In order to indicate the padding, an additional =
ASCII character is appended, making the total number of Base64 characters a multiple of four.
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
| Binary Byte A | Binary Byte B |
+--+--+--+---+---+---+--+--+--+--+--+---+---+---+
A0 A1 A2 A3 A4 A5 A6 A7 B0 B1 B2 B3 B4 B5 B6 B7 0 0 Bits
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
| Base64 Char 1 | Base64 Char 2 | Base64 Char 3 | = |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
For the case N mod 3 = 1
, the remaining 8 bits A0..A7
are padded by four zero bits and the resulting 12 bits can then be mapped to 2 Base64 characters. In order to indicate the padding, two additional =
ASCII characters are appended, making the total number of Base64 characters a multiple of four.
+--+--+--+--+--+--+--+--+
| Binary Byte A |
+--+--+--+---+---+---+--+
A0 A1 A2 A3 A4 A5 A6 A7 0 0 0 0 Bits
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
| Base64 Char 1 | Base64 Char 2 | = | = |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
Python 3: Use the built-in Base64 encoding and decoding functions and compare it to Hex encoding.
Base64 encoding of 6 binary bytes:
>>> import base64, binascii
>>> data_bin = bytes([0x11, 0x22, 0x33, 0x44, 0x55, 0x66])
>>> [format(x, '08b') for x in data_bin]
['00010001', '00100010', '00110011', '01000100', '01010101', '01100110']
>>> data_hex = binascii.hexlify(data_bin)
>>> print(data_hex, len(data_hex))
b'112233445566' 12
>>> data_b64 = base64.b64encode(data_bin)
>>> print(data_b64, len(data_b64))
b'ESIzRFVm' 8
>>> [format(x, '08b') for x in base64.b64decode(data_b64)]
['00010001', '00100010', '00110011', '01000100', '01010101', '01100110']
Base64 encoding of 5 binary bytes requires padding:
>>> import base64, binascii
>>> data_bin = bytes([0x11, 0x22, 0x33, 0x44, 0x55])
>>> [format(x, '08b') for x in data_bin]
['00010001', '00100010', '00110011', '01000100', '01010101']
>>> data_hex = binascii.hexlify(data_bin)
>>> print(data_hex, len(data_hex))
b'1122334455' 10
>>> data_b64 = base64.b64encode(data_bin)
>>> print(data_b64, len(data_b64))
b'ESIzRFU=' 8
>>> [format(x, '08b') for x in base64.b64decode(data_b64)]
['00010001', '00100010', '00110011', '01000100', '01010101']
Base64 encoding of 4 binary bytes requires padding:
>>> import base64, binascii
>>> data_bin = bytes([0x11, 0x22, 0x33, 0x44])
>>> [format(x, '08b') for x in data_bin]
['00010001', '00100010', '00110011', '01000100']
>>> data_hex = binascii.hexlify(data_bin)
>>> print(data_hex, len(data_hex))
b'11223344' 8
>>> data_b64 = base64.b64encode(data_bin)
>>> print(data_b64, len(data_b64))
b'ESIzRA==' 8
>>> [format(x, '08b') for x in base64.b64decode(data_b64)]
['00010001', '00100010', '00110011', '01000100']
Author: Andreas Steffen CC BY 4.0