You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It would be good to give some idea of how long the file parsing is going to take, e.g. when filtering the huge wikidata JSON file. I think this does it (although I'm not sure if enumerate(iter(f.readline, '')): is significantly less efficient than enumerate(f):.
I restructured the enumerate_lines_from_file function to do the "print every X lines" function, as to prepend the amount done, it needs a way to access the .tell() method of the underlying (compressed) bytes, which is more easily done within the generator function.
importioimportosimportbz2importgzipdefopen_file_based_on_extension(filename, mode):
# Open a file, whether it's uncompressed, bz2 or gziffilename.endswith(".bz2"):
returnbz2.open(filename, mode, encoding="utf-8")
eliffilename.endswith(".gz"):
returngzip.open(filename, mode, encoding="utf-8")
else:
returnopen(filename, mode, encoding="utf-8")
defenumerate_lines_from_file(filename, print_every=None, print_line_num_func=None):
""" Enumerate the lines in a file, whether it's uncompressed, bz2 or gz. If print_every is given as an integer, print a message out every print_every lines. If print_lin_num_func is given, it should be a function that takes in the line number and returns the string to print out. """underlying_file_size=os.path.getsize(filename)
withopen_file_based_on_extension(filename, "rt") asf:
ifprint_everyisnotNone:
try:
underlying_file=f.buffer.fileobj# gzipexceptAttributeError:
try:
underlying_file=f.buffer._buffer.raw._fp# b2zipexceptAttributeError:
underlying_file=f# plainforline_num, lineinenumerate(iter(f.readline, '')):
ifprint_everyisnotNoneandline_num!=0andline_num%print_every==0:
underlying_file_pos=underlying_file.tell()
percent_done=100*underlying_file_pos/underlying_file_sizeifprint_line_num_funcisnotNone:
print(f"{percent_done:.2f}% read. "+print_line_num_func(line_num))
else:
print(f"{percent_done:.2f}% read. "+f"Processing line {line_num}")
yieldline# Test plainforlineinenumerate_lines_from_file("test.txt", 1000):
pass# Test gzipforlineinenumerate_lines_from_file("test.gz", 1000):
pass# Test bz2 (also with bespoke print function)count=0forlineinenumerate_lines_from_file("test.bz2", 1000, lambdaline: f"{count} out of {line} lines start with A"):
ifline.startswith("A"):
count+=1pass
The text was updated successfully, but these errors were encountered:
It would be good to give some idea of how long the file parsing is going to take, e.g. when filtering the huge wikidata JSON file. I think this does it (although I'm not sure if
enumerate(iter(f.readline, '')):
is significantly less efficient thanenumerate(f):
.I restructured the
enumerate_lines_from_file
function to do the "print every X lines" function, as to prepend the amount done, it needs a way to access the.tell()
method of the underlying (compressed) bytes, which is more easily done within the generator function.The text was updated successfully, but these errors were encountered: