Progress when filtering large files #66

hyanwong · 2024-06-30T22:12:08Z

It would be good to give some idea of how long the file parsing is going to take, e.g. when filtering the huge wikidata JSON file. I think this does it (although I'm not sure if enumerate(iter(f.readline, '')): is significantly less efficient than enumerate(f):.

I restructured the enumerate_lines_from_file function to do the "print every X lines" function, as to prepend the amount done, it needs a way to access the .tell() method of the underlying (compressed) bytes, which is more easily done within the generator function.

import io
import os
import bz2
import gzip

def open_file_based_on_extension(filename, mode):
    # Open a file, whether it's uncompressed, bz2 or gz
    if filename.endswith(".bz2"):
        return bz2.open(filename, mode, encoding="utf-8")
    elif filename.endswith(".gz"):
        return gzip.open(filename, mode, encoding="utf-8")
    else:
        return open(filename, mode, encoding="utf-8")


def enumerate_lines_from_file(filename, print_every=None, print_line_num_func=None):
    """
    Enumerate the lines in a file, whether it's uncompressed, bz2 or gz. If print_every
    is given as an integer, print a message out every print_every lines. If
    print_lin_num_func is given, it should be a function that takes in the line number
    and returns the string to print out.
    """
    underlying_file_size = os.path.getsize(filename)
    with open_file_based_on_extension(filename, "rt") as f:
        if print_every is not None:
            try:
                underlying_file = f.buffer.fileobj  # gzip
            except AttributeError:
                try:
                    underlying_file = f.buffer._buffer.raw._fp  # b2zip
                except AttributeError:
                    underlying_file = f  # plain
        for line_num, line in enumerate(iter(f.readline, '')):
            if print_every is not None and line_num != 0 and line_num % print_every == 0:
                underlying_file_pos = underlying_file.tell()
                percent_done = 100 * underlying_file_pos / underlying_file_size
                if print_line_num_func is not None:
                    print(f"{percent_done:.2f}% read. " + print_line_num_func(line_num))
                else:
                    print(f"{percent_done:.2f}% read. " + f"Processing line {line_num}")
            yield line

# Test plain
for line in enumerate_lines_from_file("test.txt", 1000):
    pass

# Test gzip
for line in enumerate_lines_from_file("test.gz", 1000):
    pass

# Test bz2 (also with bespoke print function)
count = 0
for line in enumerate_lines_from_file("test.bz2", 1000, lambda line: f"{count} out of {line} lines start with A"):
    if line.startswith("A"):
        count += 1
    pass

The text was updated successfully, but these errors were encountered:

davidebbo · 2024-07-01T06:20:29Z

Ah, brilliant thanks! I can integrate that back into the code base when I get back next week.

davidebbo · 2024-07-01T06:51:06Z

We could also predict the remaining time and ETA quite accurately.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Progress when filtering large files #66

Progress when filtering large files #66

hyanwong commented Jun 30, 2024 •

edited

Loading

davidebbo commented Jul 1, 2024

davidebbo commented Jul 1, 2024

Progress when filtering large files #66

Progress when filtering large files #66

Comments

hyanwong commented Jun 30, 2024 • edited Loading

davidebbo commented Jul 1, 2024

davidebbo commented Jul 1, 2024

hyanwong commented Jun 30, 2024 •

edited

Loading