strutil
It provides with several string operation functions.
This library is considered production ready.
from pykit import strutil
lines = [
'hello',
'world',
]
# add left padding to each line in a string
strutil.line_pad('\n'.join(lines), ' ' * 4)
# " hello"
# " world"
# format a multi-row line
items = [ 'name:',
[ 'John',
'j is my nick'
],
'age:',
26,
'experience:',
[ '2000 THU',
'2006 sina',
'2010 other'
],
]
strutil.format_line(items, sep=' | ', aligns = 'llllll')
# outputs:
# name: | John | age: | 26 | experience: | 2000 THU
# | j is my nick | | | | 2006 sina
# | | | | | 2010 other
from pykit.strutil import blue
from pykit.strutil import green
blue("blue") + " and " + green("green")
The above snippet will output colored text on a terminal:
If you are going to use colored string as terminal prompt, the terminal prompt is not wrapping correctly with very long commands. You'd better tell ColoredString that is a prompt color string.
from pykit.strutil import ColoredString
prompt = ColoredString('colored prompt# ', color='red', prompt=True)
Those screenshots show this issue, the cursor is box.
prompt=False
long command:
prompt=False
long command after Home Key:
prompt=False
long command after End Key:
prompt=True
long command:
prompt=True
long command after Home Key:
prompt=True
long command after End Key:
syntax:
strutil.ColoredString(normal_string, color=None, prompt=True)
It provides the colored string in terminal on Unix.
arguments:
-
normal_string
: the string to colour. -
color
: the color of normal_string, named color are:blue
cyan
green
purple
red
white
yellow
optimal
normal
loaded
warn
danger
,you can colour string with integer [0-256].
return: An instance of strutil.ColoredString.
syntax:
ColoredString.join(iterable)
Return a ColoredString instance which is the concatenation of the elements in
iterable
, the separator between elements is the instance providing this method.
It does not change color of the element. String element has no color in the result.
arguments:
iterable
: an iterable whose elements are string values or ColoredString instances.
return: An instance of strutil.ColoredString.
syntax:
ColoredString.split([sep[, maxsplit]])
Return a list of the colored words in the ColoredString. If maxsplit
is given, at most
maxsplit
splits are done. If maxsplit
is not specified or -1, then there is
no limit on the number of splits.
If sep
is given, consecutive delimiters are not grouped together and are
deemed to delimit empty strings. The sep
argument may consist of multiple
characters. Splitting an empty ColoredString with a specified separator returns
[ColoredString('')]
.
If sep
is not specified or is None
, a diffrent splitting algorithm is
applied: runs of consecutive whitespace are regarded as a single separator, and
the result will contain no empty strings at the start or end if the string has
leading or trailing whitespace. Consequently, splitting an empty string or a
string consisting of just whitespace with a None
separator returns []
.
Works just like str.split
.
For example:
#[ColoredString('hello'), ColoredString('pykit')]
ColoredString(' hello pykit ').split()
#[ColoredString('hello', 'red'), ColoredString('pykit', 'blue')]
color = ColoredString('hello<', 'red') + ColoredString('>pykit', 'blue')
color.split('<>')
#[ColoredString('hel', 'red'), ColoredString('lo<', 'red')+ColoredString('>pykit', 'blue')]
color = ColoredString('hel<>lo<', 'red') + ColoredString('>pykit', 'blue')
color.split('<>', 1)
arguments:
-
sep
: is the delimiter to split the ColoredString. Optional. Ifsep
is not specified or is None, then runs of consecutive whitespace are regarded as a single separator. -
maxsplit
: is the number of splits. Optional. Ifmaxsplit
is not specified or -1, then there is no limit on the number of splits.
return: a list of the colored words.
syntax:
ColoredString.splitlines([keepend])
Return a list of the lines in the ColoredString, breaking at line boundaries.
This method recognizes "\r"
, "\n"
and "\r\n"
as line boundaries. Line
breaks are not included in the resulting list unless keepend
is given and true.
Works just like str.splitlines
.
For example:
#[ColoredString('hello'), ColoredString('pykit')]
ColoredString('hello\npykit').splitlines()
#[ColoredString('hello\n', 'red'), ColoredString('pykit\n', 'blue')]
color = ColoredString('hello\n', 'red') + ColoredString('pykit\n', 'blue')
color.splitlines(true)
arguments:
keepend
: is a bool. Ifkeepend
is given and true, then line breaks are included in the resulting list. Optional.
return: a list of the lines in the ColoredString, breaking at line boundaries.
syntax:
strutil.TrieNode()
TrieNode
is a sub class of dict
to represent a node in trie.
It has the same __init__
as dict
.
See strutil.make_trie
.
Total number of items with the prefix represented by this node. None
means
there might be more items with this prefix thus its number can not be decided
yet.
Trie branch key, a single char
An outstanding node is a node that there might be more following input string which has its corresponding prefix.
This attribute points to the last added child node.
If this node is an outstanding node to its parent node. When created, a node must be an outstanding node.
syntax:
strutil.Hex(data, byte_length)
Create a str
based instance that represents a fixed-length hex string,
to simplify hex and arithmetic operations.
An Hex
instance supports arithmetic operations: + - * / % **
.
NOTE: it overrides native str
operation such as str + str
.
arguments:
-
data
: can be astr
,int
or tuple in form of(<prefix_hex>, <filling_byte>)
-
byte_length
: specifies number of bytes for this hex. It can not be changed after creating it.byte length x 2 = hex length
It also can be a symblic name:
crc32
,md5
,sha1
orsha256
.
Synopsis:
from pykit.strutil import Hex
# Different ways to create a 4-byte crc32 hex str
Hex(0x0102, 4) # 00000102
Hex(0x0102, 'crc32') # 00000102
Hex.crc32(0x0102) # 00000102
Hex('00000102', 'crc32') # 00000102
Hex.crc32('00000102') # 00000102
# Create with a tuple of prefix and a filling byte
Hex(('12', 1), 'crc32') # 12010101
# Arithmetic operations
c = Hex(0x0102, 'crc32')
d = c + 1
print type(d), d # <class 'pykit.strutil.hex.Hex'> 00000103
print repr(c*2) # '00000204'
print c - 1000000 # 00000000 # overflow protection
print c * 1000000 # ffffffff # overflow protection
# Iterate over sha1 space with a specific step:
c = Hex.sha1(0)
step = Hex.sha1(('10', 0))
for i in range(16):
print c
c += step
# 0000000000000000000000000000000000000000
# 1000000000000000000000000000000000000000
# 2000000000000000000000000000000000000000
# 3000000000000000000000000000000000000000
# 4000000000000000000000000000000000000000
# 5000000000000000000000000000000000000000
# 6000000000000000000000000000000000000000
# 7000000000000000000000000000000000000000
# 8000000000000000000000000000000000000000
# 9000000000000000000000000000000000000000
# a000000000000000000000000000000000000000
# b000000000000000000000000000000000000000
# c000000000000000000000000000000000000000
# d000000000000000000000000000000000000000
# e000000000000000000000000000000000000000
# f000000000000000000000000000000000000000
Hex.hex
: a plain string of hex'00000102'
.Hex.bytes
: a plain string of bytes'\0\0\1\2'
.Hex.int
: a int value0x0102
same asint('00000102', 16)
.Hex.byte_length
: number of bytes inHex.bytes
.
syntax:
strutil.break_line(linestr, width)
Split linestr
to lines by one space or line break
to make length of every line no greater than width
.
Only one space or line break is replaced at a time. Any others stay.
If linestr
is a ColoredString instance, color of non-blank chars do not change.
Examples:
strutil.break_line('foo bar bar.', 9)
#['foo bar', 'bar.']
print strutil.break_line('one two three', 4)
#['one ', ' two', '', 'three']
strutil.break_line(ColoredString('foo bar ', 'blue') + ColoredString('bar.', 'red'), 9)
#[ColoredString('foo', 'blue') + ColoredString(' ') + ColoredString('bar', 'blue'), ColoredString('bar.', 'red')]
arguments:
-
linestr
: is a string or a ColoredString instance. -
width
: is the longest line length expected after being split. Ifwidth
is negative, get the same result aswidth
is 0. And ifwidth
is a float, just integer part is used.
return:
A list filled with lines of split linestr
.
syntax:
strutil.<color>(str)
Create colored string to use in terminal.
t = strutil.blue("blue-text")
Supported operation on colored string t
:
# concatenate with other colored string:
t + strutil.green("green-text")
# concatenate with plain string:
t + "a"
# repeat:
t * 3
# length:
len(t)
Supported color names:
blue
cyan
green
yellow
red
purple
white
optimal
same asblue
normal
no colorloaded
same asgreen
warn
same asdark yellow
danger
same asred
syntax:
strutil.colorize(percent, total=100, ptn='{0}')
Colorize a percentage number.
Synopsis:
from pykit.strutil import colorize
# the color of p printed: blue -> green -> yellow -> red
for p in xrange(0, 100):
print colorize(p, 100),
print
# the color of p printed: red -> yellow -> green -> blue
for p in xrange(0, 100):
print colorize(p, -100),
print
# the color of p printed is red if p>=10
for p in xrange(0, 100):
print colorize(p, 10),
print
# the color of p printed is blue if p>=10
for p in xrange(0, 100):
print colorize(p, -10),
print
# 'the percent is: 100' with red
print colorize(100, 100, 'the percent is: {0}')
# ' 22%' with green
print colorize(22, 100, '{0:>3}%')
arguments:
-
percent
: the percent to colour. -
total
: the limitation of percent to colour. negative integer means to reverse. -
ptn
: the format of percent.
return: A colored formatted percent string.
Find common prefix of several string
s, tuples of string, or other nested
structure, recursively by default.
It returns the shortest prefix: empty string or empty tuple is removed.
Synopsis:
from pykit import strutil
strutil.common_prefix('abc', 'abd') # 'ab'
strutil.common_prefix((1, 2, 'abc'), (1, 2, 'abd')) # (1, 2, 'ab')
strutil.common_prefix((1, 2, 'abc'), (1, 2, 'xyz')) # (1, 2); empty prefix of 'abc' and 'xyz' is removed
strutil.common_prefix((1, 2, (5, 6)), (1, 2, (5, 7))) # (1, 2, (5,) )
strutil.common_prefix('abc', 'abd', 'abe') # 'ab'; common prefix of more than two
strutil.common_prefix((1, 2, 'abc'), (1, 2, 'abd'), recursive=False) # (1, 2)
syntax:
strutil.common_prefix(a, *others, **options)
arguments:
a
and element inothers
: arestring
,tuple
orlist
to find common prefix of them.- if field
recursive
inoptions
is set toFalse
, it will run non-recursively.
return:
a common prefix of the same type of a
.
syntax:
strutil.line_pad(linestr, padding)
arguments:
-
linestr
: multiple line string with\n
as line separator. -
padding
: left padding string to add before each line.It could also be a callable object that returns a string. This is useful when creating dynamic padding.
return:
multiple line string with \n
as line separator, with left padding added.
syntax:
strutil.make_trie(sorted_iterable, node_max_num=1)
Make a trie from a series of strings.
It also tries to squash (at most node_max_num
) leaf nodes to an ancestor to
reduce memory usage.
Thus when using it to calculate strings distribution, it achieves accuracy in
the unit of node_max_num
.
Building a trie is in O(n)
time cost, where n
is total numbers of chars in
the input.
Synopsis:
from pykit import strutil
t = strutil.make_trie(
(
'abc',
'abcdef',
'abcdeg',
'abcdfg',
'abd',
'abe',
'axy',
'b11',
'b12',
'b123',
'b14',
), node_max_num=3)
print str(t)
# 11: a,7: b,6: c,4: d,3
# d,1
# e,1
# x,1
# b,4: 1,4: 1,1
# 2,2
# 4,1
print str(t['a'])
# a,7: b,6: c,4: d,3
# d,1
# e,1
# x,1
arguments:
-
sorted_iterable
: is an iterable(list
,tuple
,generator
or else) of strings or other comparable elements. Sample of validsorted_iterable
:sorted_iterable = ['abc', 'abd', ] sorted_iterable = ('abc', 'abd', ) sorted_iterable = [(1, 'x'), (1, 'y'), ] def gen(): for i in range(3): yield (i,) sorted_iterable = gen()
Elements in it must be sorted ascending or descending.
-
node_max_num
: specifies the maximum number of strings a leaf node can have. If total number of all the leaf nodes of a parent node smaller or equalnode_max_num
, it squash leaf nodes to their parent.
return:
a TrieNode
instance, which is the root of a trie.
syntax:
strutil.format_line(elts, sep=' ', aligns='')
It formats a list in a multi row manner.
It is compatible with colored string such as those created with strutil.blue("blue-text")
.
strutil.format_line([1, "a:", "b"], sep=" | ")
# "1 | a: | b"
strutil.format_line([["name:", "age:"], ["drdrxp", "18"], "wow"], sep=" | ", aligns="rll")
# "name: | drdrxp | wow"
# " age: | 18 |"
arguments:
-
elts
: elements in a line. Each element could be astring
or alist
ofstring
. If it is alist
ofstring
, it would be rendered as a multi-row element. -
sep
: specifies the separator between each element in a line. By default it is a single space" "
. -
aligns
: specifies alignment for each element.l
for left-align.r
for right-align.
If no alignment specified for i-th element, it will be aligned to right by default.
return: formatted string.
syntax:
strutil.page(lines, max_lines=10, control_char=True, pager=('less',))
Display lines
of string in console, with a pager program (less
) if too many
lines.
It could be used in a interactive tool to display large content.
It output strings directly to stdout.
arguments:
-
lines
: islist
of lines to display. -
max_lines
: specifies the max lines not to use a pager.By default it is 10 lines.
-
control_char
: specifies if to interpret controlling chars, such as color char in terminal. -
pager
: specifies the program as a pager.It is a list of command and argument.
By default it is
('less',)
.
return: Nothing.
syntax:
strutil.sharding(sorted_iterable, size, accuracy=None, joiner=''.join)
Split sorted_iterable
into segments, each one of those contains
size
to size + accuracy
strings, except the last one.
Time complexity is O(n)
, where n
is the total number of chars in all strings.
It uses a trie as underlying storage to track string distribution.
See strutil.make_trie
.
It returns a list of tuples contains start key and segment size.
-
The start key is a prefix of one of
sorted_iterable
. Ifsorted_iterable
is a list of strings, start key is a prefix string. Ifjoiner
is specified, start key is constructed withjoiner
. -
A start key are always shortened to the minimal size.
-
The first start key is always
None
. -
The smaller
accuracy
is, the more memory is cost, because it need to track more distribution information during scanning the strings.accuracy=1
let it track every single char in all strings.
from pykit import strutil
with open('words.txt') as f:
lines=f.readlines()
lines=[x.strip() for x in lines]
print strutil.sharding(lines, size=200, accuracy=20)
# [
# (None , 209, ),
# ('M' , 202, ),
# ('TestU' , 202, ),
# ('br' , 202, ),
# ('dc' , 201, ),
# ('exi' , 202, ),
# ('inf' , 204, ),
# ('may' , 205, ),
# ('pf' , 200, ),
# ('rew' , 208, ),
# ('suc' , 204, ),
# ('wh' , 56, ),
# ]
arguments:
-
sorted_iterable
: an iterable of strings, it can be alist
,tuple
orgenerator
. All strings in it must be sorted. -
size
: specifies the expected minimal segment size. -
accuracy
: specifies the acceptable variance of segment size. The result segment size is betweensize
andsize + accuracy
.By default it is
size / 10
-
joiner
: specifies what function to use to reconstructe a sharding key from trie-node-path.By default it reconstruct a string from a list of chars.
To generate
tuple
sharding keys: usestrutil.sharding(.., joiner=tuple)
.
return: a list of tuple.
syntax:
strutil.struct_repr(d, key=None)
Render primitive or composite data to a structural representation string list.
a = {
1: 3,
'x': {1:4, 2:5},
'l': [1, 2, 3],
}
for l in strutil.struct_repr(a):
print l
# 1 : 3
# l : - 1
# - 2
# - 3
# x : 1 : 4
# 2 : 5
arguments:
-
d
: a number, string, list or dict to render to a structural representation. -
key
: is a callable that is used to sort dict keys.It is used in sort:
keys.sort(key=key)
.
return: a list of string.
syntax:
strutil.format_table(rows, keys=None, colors=None, row_sep=None, sep=' | ')
Render a list of data into a table.
Number of rows is len(rows)
.
Number of columns is len(rows[0])
.
inp = [
{'acl': {},
'bucket': 'game1.read',
'bucket_id': '1400000000000689036',
'num_used': '0',
'owner': 'game1',
'space_used': '0',
'ts': '1492091893065708032'},
{'acl': {},
'bucket': 'game2.read',
'bucket_id': '1510000000000689037',
'num_used': '0',
'owner': 'game2',
'space_used': '0',
'ts': '1492091906629786880'},
{'acl': {'imgx': ['READ', 'READ_ACP', 'WRITE', 'WRITE_ACP']},
'bucket': 'imgx-test',
'bucket_id': '1910000000000689048',
'num_used': '0',
'owner': 'imgx',
'space_used': '0',
'ts': '1492101189213795840'}]
for l in strutil.format_table(inp, row_sep='-'):
print l
# acl: | bucket: | bucket_id: | num_used: | owner: | space_used: | ts:
# -----------------------------------------------------------------------------------------------------------------
# {} | game1.read | 1400000000000689036 | 0 | game1 | 0 | 1492091893065708032
# -----------------------------------------------------------------------------------------------------------------
# {} | game2.read | 1510000000000689037 | 0 | game2 | 0 | 1492091906629786880
# -----------------------------------------------------------------------------------------------------------------
# imgx : - READ | imgx-test | 1910000000000689048 | 0 | imgx | 0 | 1492101189213795840
# - READ_ACP | | | | | |
# - WRITE | | | | | |
# - WRITE_ACP | | | | | |
# customize column header:
for l in strutil.format_table(inp, keys=[['bucket', 'Bkt'],
['num_used', 'n']]):
print l
# Bkt: | n:
# game1.read | 0
# game2.read | 0
# imgx-test | 0
arguments:
-
rows
: list of items to render.Element of list can be number, string, list or dict.
-
keys
: specifies indexes(for list) or keys(for dict) to render. It is a list.Indexes or keys those are not in this list will not be rendered.
It can also be used to specify customized column headers, if element in list is a 2-element tuple or list:
keys = [ ('bucket', 'Bkt'), ('num_used', 'n'), 'bucket_id', ] # It output a talbe like below: # Bkt: | n: | bucket_id: # game1.read | 0 | ...
-
colors
: specifies the color for each column. It is a list of color values in number or color name strings.If length of
colors
is smaller than the number of columns(the number of indexes of a list, or keys of a dict), the colors are repeated for columns after. -
row_sep
: specifies char to separate rows.By default it is None, it means do not add line separator.
-
sep
: specifies column separator char.By default it is
" | "
.
return: a list of string.
syntax:
strutil.filter_invisible_chars(data)
Filters invisible characters in a string or a unicode object
from pykit.strutil import filter_invisible_chars
cases = [
"1273883926293937729\000\001\031",
"\x00\x01\x02\x03\x04\005",
u"1122299299299299292",
u"\x00\x01\x02\x03\x04\005",
]
rst = []
for case in cases:
rst.append(strutil.filter_invisible_chars(case))
for r in rst:
print r
#'1273883926293937729'
#''
#u'1122299299299299292'
#u''
arguments:
data
: a string or unicode object to filter invisible characters
return: a filtered string or unicode object
syntax:
strutil.tokenize(linestr, sep=None, quote='"\'', preserve=False)
Tokenize a line.
Synopsis:
from pykit.strutil import tokenize
# ['ab']
print tokenize('ab')
# ['a', 'b']
print tokenize('a b')
# ['a', 'b']
print tokenize(' a b ')
# ['a', 'b']
print tokenize(' a\t b\n c\r ')
# ['a b', 'c d']
print tokenize('a bxyc d', sep='xy')
# ['a', 'x x', 'b']
print tokenize('a "x x" b')
# ['a', 'x x', 'b']
print tokenize("a 'x x' b 'x") # the last `'x` has no pair, discard
# ['a', 'a b', 'c d']
print tokenize(' a xa bx yc dy ', quote='xy')
# ['a', 'xa bx', 'yc dy']
print tokenize('a xa bx yc dy', quote='xy', preserve=True)
# ['', 'a', 'xa bx', 'yc dy', '']
print tokenize(' a xa bx yc dy ', sep=' ', quote='xy', preserve=True)
arguments:
-
linestr
: the line to tokenize. -
sep
: is None or a non-empty string separator to tokenize with. If sep is None, runs of consecutive whitespace are regarded as a single separator, and the result will contain no empty strings at the start or end if the string has leading or trailing whitespace. Consequently, splitting an empty string or a string consisting of just whitespace with a None separator returns[]
. Just likestr.split(None)
. By default,sep
is None. -
quote
: Every character inquote
is regarded as a quote. Add a\
prefix to make an exception. Segment between the same quotes is preserved. By default,quote
is'"\''
. -
preserve
: preserve the quote itself ifpreserve
isTrue
. By default,preserve
isFalse
.
return: a list of string.
syntax:
strutil.parse_colon_kvs(data)
Convert a string constant to a dict type of data
from pykit.strutil import parse_colon_kvs
print parse_colon_kvs("abc:123")
# {'abc': '123'}
# For repeated keys, using the back value
print parse_colon_kvs("abc:123 abc:456")
# {'abc': '456'}
print parse_colon_kvs("")
# {}
arguments:
data
: A string constant, in the format "k:v", is used to convert to data of dict type
return: the data of dict type
Zhang Yanpo (张炎泼) [email protected]
The MIT License (MIT)
Copyright (c) 2015 Zhang Yanpo (张炎泼) [email protected]