Character Sets

ASCII

Still one of the most widely used character encodings for electronic communication is ASCII , the American Standard Code for Information Interchange. The output of man ascii shown below lists the original 7-bit US-ASCII code covering only latin characters used in English plus a lot of control characters used in the early days of modem links, teleprinters and console terminals.

Oct	Dec	Hex	Char	Oct	Dec	Hex	Char
000	0	00	NUL '\0' (null character)	020	16	10	DLE (data link escape)
001	1	01	SOH (start of heading)	021	17	11	DC1 (device control 1)
002	2	02	STX (start of text)	022	18	12	DC2 (device control 2)
003	3	03	ETX (end of text)	023	19	13	DC3 (device control 3)
004	4	04	EOT (end of transmission)	024	20	14	DC4 (device control 4)
005	5	05	ENQ (enquiry)	025	21	15	NAK (negative ack.)
006	6	06	ACK (acknowledge)	026	22	16	SYN (synchronous idle)
007	7	07	BEL '\a' (bell)	027	23	17	ETB (end of trans. blk)
010	8	08	BS '\b' (backspace)	030	24	18	CAN (cancel)
011	9	09	HT '\t' (horizontal tab)	031	25	19	EM (end of medium)
012	10	0A	LF '\n' (new line)	032	26	1A	SUB (substitute)
013	11	0B	VT '\v' (vertical tab)	033	27	1B	ESC (escape)
014	12	0C	FF '\f' (form feed)	034	28	1C	FS (file separator)
015	13	0D	CR '\r' (carriage ret)	035	29	1D	GS (group separator)
016	14	0E	SO (shift out)	036	30	1E	RS (record separator)
017	15	0F	SI (shift in)	037	31	1F	US (unit separator)

SPACE and numbers 0 to 9:

Oct	Dec	Hex	Char	Oct	Dec	Hex	Char
040	32	20	SPACE	060	48	30	0
041	33	21	!	061	49	31	1
042	34	22	"	062	50	32	2
043	35	23	#	063	51	33	3
044	36	24	$	064	52	34	4
045	37	25	%	065	53	35	5
046	38	26	&	066	54	36	6
047	39	27	'	067	55	37	7
050	40	28	(	070	56	38	8
051	41	29	)	071	57	39	9
052	42	2A	*	072	58	3A	:
053	43	2B	+	073	59	3B	;
054	44	2C	,	074	60	3C	<
055	45	2D	-	075	61	3D	=
056	46	2E	.	076	62	3E	>
057	47	2F	/	077	63	3F	?

Capital letters A to Z and small letters a to z:

Oct	Dec	Hex	Char	Oct	Dec	Hex	Char
100	64	40	@	140	96	60	`
101	65	41	A	141	97	61	a
102	66	42	B	142	98	62	b
103	67	43	C	143	99	63	c
104	68	44	D	144	100	64	d
105	69	45	E	145	101	65	e
106	70	46	F	146	102	66	f
107	71	47	G	147	103	67	g
110	72	48	H	150	104	68	h
111	73	49	I	151	105	69	i
112	74	4A	J	152	106	6A	j
113	75	4B	K	153	107	6B	k
114	76	4C	L	154	108	6C	l
115	77	4D	M	155	109	6D	m
116	78	4E	N	156	110	6E	n
117	79	4F	O	157	111	6F	o
120	80	50	P	160	112	70	p
121	81	51	Q	161	113	71	q
122	82	52	R	162	114	72	r
123	83	53	S	163	115	73	s
124	84	54	T	164	116	74	t
125	85	55	U	165	117	75	u
126	86	56	V	166	118	76	v
127	87	57	W	167	119	77	w
130	88	58	X	170	120	78	x
131	89	59	Y	171	121	79	y
132	90	5A	Z	172	122	7A	z
133	91	5B	[	173	123	7B	{
134	92	5C	\ '\'	174	124	7C	\|
135	93	5D	]	175	125	7D	}
136	94	5E	^	176	126	7E	~
137	95	5F	_	177	127	7F	DEL

The extended ASCII code uses 8 bits thus increasing the code table to 2^8 = 256 characters in order to cover characters in other languages, e.g. the ISO 8859 Latin-1 extension that covers latin characters in other western languages: à ä e e ê ô ö ù ü …

Python 1: We learn how to manipulate ASCII characters and strings

The ordinal value of a single ASCII character can be determined with the ord() function:

>>> print(ord('0'), ord('9'))
48 57
>>> print(ord('A'), ord('Z'))
65 90
>>> print(ord('a'), ord('z'))
97 122

When the ordinal value is known, the corresponding ASCII character can be generated with the chr() function

>>> print(chr(48), chr(57))
0 9
>>> print(chr(65), chr(90))
A Z

We generate a byte array from an ASCII string containing the numbers 0 to 9

b_num = bytearray("0123456789", "ascii")
>>> print(b_num, len(b_num))
bytearray(b'0123456789') 10
>>> [c for c in b_num]
[48, 49, 50, 51, 52, 53, 54, 55, 56, 57]                      # dec
>>> b_num[9]
57
>>> [format(c, '02x') for c in b_num]
['30', '31', '32', '33', '34', '35', '36', '37', '38', '39']  # hex
>>> b_num.decode("ascii")    # convert back from bytearray to ASCII string
'0123456789'

We generate a byte array from an ASCII string containing the uppercase characters A to Z

>>> b_chr = bytearray("ABCDEFGHIJKLMNOPQRSTUVWXYZ", "ascii")
>>> print(b_chr, len(b_chr))
bytearray(b'ABCDEFGHIJKLMNOPQRSTUVWXYZ') 26
>>> [c for c in b_chr]
[65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90]
>>> b_chr[0]
65
>>> [format(c, '02x') for c in b_chr]
['41', '42', '43', '44', '45', '46', '47', '48', '49', '4a', '4b', '4c', '4d', '4e', '4f', '50', '51', '52', '53', '54', '55', '56', '57', '58', '59', '5a']
>>> b_chr.decode("ascii")    # convert back from bytearray to ASCII string
'ABCDEFGHIJKLMNOPQRSTUVWXYZ'

We load an ASCII string into a byte array, convert the byte array into a hex string and convert the hex string back into an ASCII string.

>>> import binascii
>>> num_str = '0123456789'
>>> num_arr = bytearray(num_str, 'ascii')
>>> print(num_arr)
bytearray(b'0123456789')
>>> num_hex = binascii.hexlify(num_arr)
>>> print(num_hex)
b'30313233343536373839'
>>> binascii.unhexlify(num_hex)
b'0123456789'

The following python script ascii.py capitalizes lowercase characters in a byte array by unsetting bit 5 in the ASCII encoding and then converts all uppercase characters to lowercase by setting bit 5.

#!/usr/bin/python3
b_arr = bytearray("Oh my god! There are 20 students in the classroom, (\x07)", "ascii")
print(b_arr)

# Capitalize lower case characters by unsetting bit 5
for i in range(0, len(b_arr)):
    if b_arr[i] > 0x60 and b_arr[i] < 0x7b:
        b_arr[i] &= 0b11011111
print(b_arr)

# Convert all uppercase characters to lowercase by setting bit 5
for i in range(0, len(b_arr)):
    if b_arr[i] > 0x40 and b_arr[i] < 0x5b:
        b_arr[i] |= 0x20
print(b_arr)

The output of the script is

python3 ascii.py
bytearray(b'Oh my god! There are 20 students in the classroom, (\x07)')
bytearray(b'OH MY GOD! THERE ARE 20 STUDENTS IN THE CLASSROOM, (\x07)')
bytearray(b'oh my god! there are 20 students in the classroom, (\x07)')

As a comparison we show how conversion to uppercase or lowercase can be easily done using Python string functions.

>>> b_str = 'Oh my god! There are 20 students in the classroom, (\x07)'
>>> b_str.upper()
'OH MY GOD! THERE ARE 20 STUDENTS IN THE CLASSROOM, (\x07)'
>>> b_str.lower()
'oh my god! there are 20 students in the classroom, (\x07)'

UTF-8

The 8-bit Unicode Transformation Format UTF-8 is a variable width character encoding capable of encoding all 1,112,064 valid code points in the Unicode (Universal Coded Character Set) using one to four 8-bit bytes.

It was designed for backward compatibility with ASCII. Code points with lower numerical values, which tend to occur more frequently, are encoded using fewer bytes. The first 128 characters of Unicode, which correspond one-to-one with ASCII, are encoded using a single byte with the same binary value as ASCII, so that valid ASCII text is valid UTF-8-encoded Unicode as well. Since ASCII bytes do not occur when encoding non-ASCII code points into UTF-8, UTF-8 is safe to use within most programming and document languages that interpret certain ASCII characters in a special way, such as / (slash) in filenames, \ (backslash) in escape sequences, and % in printf.

Since the restriction of the Unicode code-space to 21-bit values in 2003, UTF-8 is defined to encode code points in one to four bytes, depending on the number of significant bits in the numerical value of the code point. The following table shows the structure of the encoding. The x characters are replaced by the bits of the code point. If the number of significant bits is no more than seven, the first line applies; if no more than 11 bits, the second line applies, and so on.

Number of Bytes	Bits for code point	First code point	Last code point	Byte 1	Byte 2	Byte 3	Byte 4
1	7	U+0000	U+007F	0xxxxxxx
2	11	U+0080	U+07FF	110xxxxx	10xxxxxx
3	16	U+0800	U+FFFF	1110xxxx	10xxxxxx	10xxxxxx
4	21	U+10000	U+10FFFF	11110xxx	10xxxxxx	10xxxxxx	10xxxxxx

The first 128 characters (US-ASCII) need one byte.
The next 1,920 characters need two bytes to encode, which covers the remainder of almost all Latin-script alphabets, and also Greek, Cyrillic, Coptic, Armenian, Hebrew, Arabic, Syriac, Thaana and N'Ko alphabets, as well as Combining Diacritical Marks.
Three bytes are needed for characters in the rest of the Basic Multilingual Plane, which contains virtually all characters in common use, including most Chinese, Japanese and Korean characters.
Four bytes are needed for characters in the other planes of Unicode, which include less common CJK characters, various historic scripts, mathematical symbols, and emoji (pictographic symbols).

Python 2: We analyze a some UTF-8 encoding examples.

The ord() and chr() functions can also be used with UTF-8 characters:

>>> ord('€')
8364
>>> chr(8364)
'€'

Next a very simple example containing the ASCII character a, the German character ä, the French character à and a latin small character a with a ring below ḁ:

>>> import binascii
>>> chr_str = u'aäàḁ'
>>> print(chr_str, len(chr_str))
aäàḁ 4
>>> [c for c in chr_str]
['a', 'ä', 'à', 'ḁ']
>>> [ord(c) for c in chr_str]
[97, 228, 224, 7681]
>>> [binascii.hexlify(bytearray(c, 'utf-8')) for c in chr_str]
[b'61', b'c3a4', b'c3a0', b'e1b881']
>>> chr_arr = bytearray(chr_str, 'utf-8')
>>> print(chr_arr, len(chr_arr))
bytearray(b'a\xc3\xa4\xc3\xa0\xe1\xb8\x81') 8
>>> [format(x, '08b') for x in chr_arr]
['01100001', '11000011', '10100100', '11000011', '10100000', '11100001', '10111000', '10000001']
>>> chr_hex = binascii.hexlify(chr_arr)
>>> print(chr_hex, len(chr_hex))
b'61c3a4c3a0e1b881' 16

Here is the UTF-8 encoding of a German pangram containing 35 different characters:

>>> import binascii
>>> chr_str = u'„Fix, Schwyz!“, quäkt Jürgen blöd vom Paß'
>>> print(chr_str, len(chr_str))
„Fix, Schwyz!“, quäkt Jürgen blöd vom Paß 41
>>> [c for c in chr_str]
['„', 'F', 'i', 'x', ',', ' ', 'S', 'c', 'h', 'w', 'y', 'z', '!', '“', ',', ' ', 'q', 'u', 'ä', 'k', 't', ' ', 'J', 'ü', 'r', 'g', 'e', 'n', ' ', 'b', 'l', 'ö', 'd', ' ', 'v', 'o', 'm', ' ', 'P', 'a', 'ß']
[binascii.hexlify(bytearray(c, 'utf-8')) for c in chr_str]
[b'e2809e', b'46', b'69', b'78', b'2c', b'20', b'53', b'63', b'68', b'77', b'79', b'7a', b'21', b'e2809c', b'2c', b'20', b'71', b'75', b'c3a4', b'6b', b'74', b'20', b'4a', b'c3bc', b'72', b'67', b'65', b'6e', b'20', b'62', b'6c', b'c3b6', b'64', b'20', b'76', b'6f', b'6d', b'20', b'50', b'61', b'c39f']
>>> chr_arr = bytearray(chr_str, 'utf-8')
>>> print(chr_arr, len(chr_arr))
bytearray(b'\xe2\x80\x9eFix, Schwyz!\xe2\x80\x9c, qu\xc3\xa4kt J\xc3\xbcrgen bl\xc3\xb6d vom Pa\xc3\x9f') 49
chr_hex = binascii.hexlify(chr_arr)
>>> print(chr_hex, len(chr_hex))
b'e2809e4669782c2053636877797a21e2809c2c207175c3a46b74204ac3bc7267656e20626cc3b66420766f6d205061c39f' 98

And here the UTF-8 encoding of a French pangram containing all 45 characters of the French language:

>> import binascii
>>> chr_str = u'Dès Noël, où un zéphyr haï me vêt de glaçons würmiens, je dîne d’exquis rôtis de bœuf au kir, à l’aÿ d’âge mûr, &cætera.'
>>> print(chr_str, len(chr_str))
Dès Noël, où un zéphyr haï me vêt de glaçons würmiens, je dîne d’exquis rôtis de bœuf au kir, à l’aÿ d’âge mûr, &cætera. 120
>>> [c for c in chr_str]
['D', 'è', 's', ' ', 'N', 'o', 'ë', 'l', ',', ' ', 'o', 'ù', ' ', 'u', 'n', ' ', 'z', 'é', 'p', 'h', 'y', 'r', ' ', 'h', 'a', 'ï', ' ', 'm', 'e', ' ', 'v', 'ê', 't', ' ', 'd', 'e', ' ', 'g', 'l', 'a', 'ç', 'o', 'n', 's', ' ', 'w', 'ü', 'r', 'm', 'i', 'e', 'n', 's', ',', ' ', 'j', 'e', ' ', 'd', 'î', 'n', 'e', ' ', 'd', '’', 'e', 'x', 'q', 'u', 'i', 's', ' ', 'r', 'ô', 't', 'i', 's', ' ', 'd', 'e', ' ', 'b', 'œ', 'u', 'f', ' ', 'a', 'u', ' ', 'k', 'i', 'r', ',', ' ', 'à', ' ', 'l', '’', 'a', 'ÿ', ' ', 'd', '’', 'â', 'g', 'e', ' ', 'm', 'û', 'r', ',', ' ', '&', 'c', 'æ', 't', 'e', 'r', 'a', '.']
[binascii.hexlify(bytearray(c, 'utf-8')) for c in chr_str]
[b'44', b'c3a8', b'73', b'20', b'4e', b'6f', b'c3ab', b'6c', b'2c', b'20', b'6f', b'c3b9', b'20', b'75', b'6e', b'20', b'7a', b'c3a9', b'70', b'68', b'79', b'72', b'20', b'68', b'61', b'c3af', b'20', b'6d', b'65', b'20', b'76', b'c3aa', b'74', b'20', b'64', b'65', b'20', b'67', b'6c', b'61', b'c3a7', b'6f', b'6e', b'73', b'20', b'77', b'c3bc', b'72', b'6d', b'69', b'65', b'6e', b'73', b'2c', b'20', b'6a', b'65', b'20', b'64', b'c3ae', b'6e', b'65', b'20', b'64', b'e28099', b'65', b'78', b'71', b'75', b'69', b'73', b'20', b'72', b'c3b4', b'74', b'69', b'73', b'20', b'64', b'65', b'20', b'62', b'c593', b'75', b'66', b'20', b'61', b'75', b'20', b'6b', b'69', b'72', b'2c', b'20', b'c3a0', b'20', b'6c', b'e28099', b'61', b'c3bf', b'20', b'64', b'e28099', b'c3a2', b'67', b'65', b'20', b'6d', b'c3bb', b'72', b'2c', b'20', b'26', b'63', b'c3a6', b'74', b'65', b'72', b'61', b'2e']
>>> chr_arr = bytearray(chr_str, 'utf-8')
>>> print(chr_arr, len(chr_arr))
bytearray(b'D\xc3\xa8s No\xc3\xabl, o\xc3\xb9 un z\xc3\xa9phyr ha\xc3\xaf me v\xc3\xaat de gla\xc3\xa7ons w\xc3\xbcrmiens, je d\xc3\xaene d\xe2\x80\x99exquis r\xc3\xb4tis de b\xc5\x93uf au kir, \xc3\xa0 l\xe2\x80\x99a\xc3\xbf d\xe2\x80\x99\xc3\xa2ge m\xc3\xbbr, &c\xc3\xa6tera.') 142
>>> chr_hex = binascii.hexlify(chr_arr)
>>> print(chr_hex, len(chr_hex))
b'44c3a873204e6fc3ab6c2c206fc3b920756e207ac3a970687972206861c3af206d652076c3aa7420646520676c61c3a76f6e732077c3bc726d69656e732c206a652064c3ae6e652064e280996578717569732072c3b47469732064652062c5937566206175206b69722c20c3a0206ce2809961c3bf2064e28099c3a26765206dc3bb722c202663c3a6746572612e' 284

Finally the UTF-8 encoding of a Russian pangram containing all 33 characters of the Russian language:

>>> import binascii
>>> chr_str = u'Эй, жлоб! Где туз? Прячь юных съёмщиц в шкаф'
>>> print(chr_str, len(chr_str))
Эй, жлоб! Где туз? Прячь юных съёмщиц в шкаф 44
>>> [binascii.hexlify(bytearray(c, 'utf-8')) for c in chr_str]
[b'd0ad', b'd0b9', b'2c', b'20', b'd0b6', b'd0bb', b'd0be', b'd0b1', b'21', b'20', b'd093', b'd0b4', b'd0b5', b'20', b'd182', b'd183', b'd0b7', b'3f', b'20', b'd09f', b'd180', b'd18f', b'd187', b'd18c', b'20', b'd18e', b'd0bd', b'd18b', b'd185', b'20', b'd181', b'd18a', b'd191', b'd0bc', b'd189', b'd0b8', b'd186', b'20', b'd0b2', b'20', b'd188', b'd0ba', b'd0b0', b'd184']
>>> chr_arr = bytearray(chr_str, 'utf-8')
>>> print(chr_arr, len(chr_arr))
bytearray(b'\xd0\xad\xd0\xb9, \xd0\xb6\xd0\xbb\xd0\xbe\xd0\xb1! \xd0\x93\xd0\xb4\xd0\xb5 \xd1\x82\xd1\x83\xd0\xb7? \xd0\x9f\xd1\x80\xd1\x8f\xd1\x87\xd1\x8c \xd1\x8e\xd0\xbd\xd1\x8b\xd1\x85 \xd1\x81\xd1\x8a\xd1\x91\xd0\xbc\xd1\x89\xd0\xb8\xd1\x86 \xd0\xb2 \xd1\x88\xd0\xba\xd0\xb0\xd1\x84') 77
>>> chr_hex = binascii.hexlify(chr_arr)
>>> print(chr_hex, len(chr_hex))
b'd0add0b92c20d0b6d0bbd0bed0b12120d093d0b4d0b520d182d183d0b73f20d09fd180d18fd187d18c20d18ed0bdd18bd18520d181d18ad191d0bcd189d0b8d18620d0b220d188d0bad0b0d184' 154

Base64

The Base64 scheme encodes N binary bytes into 4 * ⌈N/3⌉ printable ASCII characters and thus allows to carry data stored in binary formats across channels that only reliably support 7-bit US-ASCII text content. The encoding is shown below:

+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
|     Binary Byte A     |     Binary Byte B     |     Binary Byte C     |
+--+--+--+---+---+---+--+--+--+--+--+---+---+---+--+--+--+--+--+--+--+--+
 A0 A1 A2 A3 A4 A5 A6 A7 B0 B1 B2 B3 B4 B5 B6 B7 C0 C1 C2 C3 C4 C5 C6 C7  Bits
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
|  Base64 Char 1  |  Base64 Char 2  |  Base64 Char 3  |  Base64 Char 4  |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+

The 24 bits A0..A7, B0..B7, C0..C7 of 3 consecutive bytes A, B and C are split into 4 groups of 6 bits each that are then mapped to the 64 printable US-ASCII characters A..Z, a..z, 0..9, + and / according to the table listed below.

I	Binary	Char	I	Binary	Char	I	Binary	Char	I	Binary	Char
0	000000	A	16	010000	Q	32	100000	g	48	110000	w
1	000001	B	17	010001	R	33	100001	h	49	110001	x
2	000010	C	18	010010	S	34	100010	i	50	110010	y
3	000011	D	19	010011	T	35	100011	j	51	110011	z
4	000100	E	20	010100	U	36	100100	k	52	110100	0
5	000101	F	21	010101	V	37	100101	l	53	110101	1
6	000110	G	22	010110	W	38	100110	m	54	110110	2
7	000111	H	23	010111	X	39	100111	n	55	110111	3
8	001000	I	24	011000	Y	40	101000	o	56	111000	4
9	001001	J	25	011001	Z	41	101001	p	57	111001	5
10	001010	K	26	011010	a	42	101010	q	58	111010	6
11	001011	L	27	011011	b	43	101011	r	59	111011	7
12	001100	M	28	011100	c	44	101100	s	60	111100	8
13	001101	N	29	011101	d	45	101101	t	61	111101	9
14	001110	O	30	011110	e	46	101110	u	62	111110	+
15	001111	P	31	011111	f	47	101111	v	63	111111	/

If the total number N of bytes contained in a binary blob is not an exact multiple of three, i.e. N mod 3 != 0 then one or two bytes remain at the end of the data block which have to be processed separately.:

For the case N mod 3 = 2, the remaining 16 bits A0..A7, B0..B7 are padded by two zero bits and the resulting 18 bits can then be mapped to 3 Base64 characters. In order to indicate the padding, an additional = ASCII character is appended, making the total number of Base64 characters a multiple of four.

+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
|     Binary Byte A     |     Binary Byte B     |
+--+--+--+---+---+---+--+--+--+--+--+---+---+---+
 A0 A1 A2 A3 A4 A5 A6 A7 B0 B1 B2 B3 B4 B5 B6 B7  0  0                    Bits
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
|  Base64 Char 1  |  Base64 Char 2  |  Base64 Char 3  |        =        |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+

For the case N mod 3 = 1, the remaining 8 bits A0..A7 are padded by four zero bits and the resulting 12 bits can then be mapped to 2 Base64 characters. In order to indicate the padding, two additional = ASCII characters are appended, making the total number of Base64 characters a multiple of four.

+--+--+--+--+--+--+--+--+
|     Binary Byte A     |
+--+--+--+---+---+---+--+
 A0 A1 A2 A3 A4 A5 A6 A7 0  0  0  0                                       Bits
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
|  Base64 Char 1  |  Base64 Char 2  |        =        |        =        |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+

Python 3: Use the built-in Base64 encoding and decoding functions and compare it to Hex encoding.

Base64 encoding of 6 binary bytes:

>>> import base64, binascii
>>> data_bin = bytes([0x11, 0x22, 0x33, 0x44, 0x55, 0x66])
>>> [format(x, '08b') for x in data_bin]
['00010001', '00100010', '00110011', '01000100', '01010101', '01100110']
>>> data_hex = binascii.hexlify(data_bin)
>>> print(data_hex, len(data_hex))
b'112233445566' 12
>>> data_b64 = base64.b64encode(data_bin)
>>> print(data_b64, len(data_b64))
b'ESIzRFVm' 8
>>> [format(x, '08b') for x in base64.b64decode(data_b64)]
['00010001', '00100010', '00110011', '01000100', '01010101', '01100110']

Base64 encoding of 5 binary bytes requires padding:

>>> import base64, binascii
>>> data_bin = bytes([0x11, 0x22, 0x33, 0x44, 0x55])
>>> [format(x, '08b') for x in data_bin]
['00010001', '00100010', '00110011', '01000100', '01010101']
>>> data_hex = binascii.hexlify(data_bin)
>>> print(data_hex, len(data_hex))
b'1122334455' 10
>>> data_b64 = base64.b64encode(data_bin)
>>> print(data_b64, len(data_b64))
b'ESIzRFU=' 8
>>> [format(x, '08b') for x in base64.b64decode(data_b64)]
['00010001', '00100010', '00110011', '01000100', '01010101']

Base64 encoding of 4 binary bytes requires padding:

>>> import base64, binascii
>>> data_bin = bytes([0x11, 0x22, 0x33, 0x44])
>>> [format(x, '08b') for x in data_bin]
['00010001', '00100010', '00110011', '01000100']
>>> data_hex = binascii.hexlify(data_bin)
>>> print(data_hex, len(data_hex))
b'11223344' 8
>>> data_b64 = base64.b64encode(data_bin)
>>> print(data_b64, len(data_b64))
b'ESIzRA==' 8
>>> [format(x, '08b') for x in base64.b64decode(data_b64)]
['00010001', '00100010', '00110011', '01000100']

Author: Andreas Steffen CC BY 4.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Character_Sets.md

Character_Sets.md

Character Sets

Table of Contents

ASCII

UTF-8

Base64

Files

Character_Sets.md

Latest commit

History

Character_Sets.md

File metadata and controls

Character Sets

Table of Contents

ASCII

UTF-8

Base64