Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Progress when filtering large files #66

Open
hyanwong opened this issue Jun 30, 2024 · 2 comments
Open

Progress when filtering large files #66

hyanwong opened this issue Jun 30, 2024 · 2 comments

Comments

@hyanwong
Copy link
Member

hyanwong commented Jun 30, 2024

It would be good to give some idea of how long the file parsing is going to take, e.g. when filtering the huge wikidata JSON file. I think this does it (although I'm not sure if enumerate(iter(f.readline, '')): is significantly less efficient than enumerate(f):.

I restructured the enumerate_lines_from_file function to do the "print every X lines" function, as to prepend the amount done, it needs a way to access the .tell() method of the underlying (compressed) bytes, which is more easily done within the generator function.

import io
import os
import bz2
import gzip

def open_file_based_on_extension(filename, mode):
    # Open a file, whether it's uncompressed, bz2 or gz
    if filename.endswith(".bz2"):
        return bz2.open(filename, mode, encoding="utf-8")
    elif filename.endswith(".gz"):
        return gzip.open(filename, mode, encoding="utf-8")
    else:
        return open(filename, mode, encoding="utf-8")


def enumerate_lines_from_file(filename, print_every=None, print_line_num_func=None):
    """
    Enumerate the lines in a file, whether it's uncompressed, bz2 or gz. If print_every
    is given as an integer, print a message out every print_every lines. If
    print_lin_num_func is given, it should be a function that takes in the line number
    and returns the string to print out.
    """
    underlying_file_size = os.path.getsize(filename)
    with open_file_based_on_extension(filename, "rt") as f:
        if print_every is not None:
            try:
                underlying_file = f.buffer.fileobj  # gzip
            except AttributeError:
                try:
                    underlying_file = f.buffer._buffer.raw._fp  # b2zip
                except AttributeError:
                    underlying_file = f  # plain
        for line_num, line in enumerate(iter(f.readline, '')):
            if print_every is not None and line_num != 0 and line_num % print_every == 0:
                underlying_file_pos = underlying_file.tell()
                percent_done = 100 * underlying_file_pos / underlying_file_size
                if print_line_num_func is not None:
                    print(f"{percent_done:.2f}% read. " + print_line_num_func(line_num))
                else:
                    print(f"{percent_done:.2f}% read. " + f"Processing line {line_num}")
            yield line

# Test plain
for line in enumerate_lines_from_file("test.txt", 1000):
    pass

# Test gzip
for line in enumerate_lines_from_file("test.gz", 1000):
    pass

# Test bz2 (also with bespoke print function)
count = 0
for line in enumerate_lines_from_file("test.bz2", 1000, lambda line: f"{count} out of {line} lines start with A"):
    if line.startswith("A"):
        count += 1
    pass
@davidebbo
Copy link
Collaborator

Ah, brilliant thanks! I can integrate that back into the code base when I get back next week.

@davidebbo
Copy link
Collaborator

We could also predict the remaining time and ETA quite accurately.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants