Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Request] can dwalk "stream" text output as it reads mfu file (to avoid high RAM usge)? #563

Open
markmoe19 opened this issue Oct 16, 2023 · 3 comments

Comments

@markmoe19
Copy link

I have a large 1.4TB .mfu file generated by dwalk for 502M items.
I want to generate an unsorted text output file from this mfu file.

Does dwalk read the entire mfu file into RAM before outputting the text file?
For sorted output, I could see reading into all into RAM. But for unsorted output, could dwalk “stream” the output as it reads the mfu input and thereby not use much ram?

I’m asking because I have a service node that can generate the generate the mfu file but doesn’t have enough ram to generate (unsorted) text output from that same mfu file.

Thanks!

  • Mark
@adammoody
Copy link
Member

adammoody commented Oct 16, 2023

Good question. At the moment dwalk and the other tools can only read the entire file at once. We'd have to hack together a tool for the streaming bit.

The code that reads the .mfu file (file format version 4) is here:

static void read_cache_v4(

The loop that unpacks an encoded file entry read from the .mfu file is here:

while (packcount < (uint64_t) read_count) {
/* unpack item from buffer and advance pointer */
list_insert_ptr(flist, ptr, 1, chars);
ptr += elem_size;
packcount++;
}

The biggest change is that we couldn't unpack the entries into an mfu_flist like this function does, since the flist structure expects to have the full list loaded in memory at once. However, one could look to modify the unpack function to just print the file name instead of inserting the element into the list.

The list_insert_ptr() function unpacks each element and adds it to the list:

static size_t list_insert_ptr(flist_t* flist, char* ptr, int detail, uint64_t chars)

Most of the heavy lifting in parsing the data for each file is in list_elem_unpack, which shows how the fields in the element are set:

static size_t list_elem_unpack(const void* buf, int detail, uint64_t chars, elem_t* elem)

You could perhaps just cut-paste-edit list_insert_ptr function to have a print_ptr version that allocates, unpacks, prints the file name, and frees the element, something like:

static size_t print_ptr(char* ptr, int detail, uint64_t chars)
{
    elem_t* elem = (elem_t*) MFU_MALLOC(sizeof(elem_t));
    size_t bytes = list_elem_unpack(ptr, detail, chars, elem);
    printf("%s\n", elem->file);
    mfu_free(&elem->file);
    mfu_free(&elem);
    return bytes;
}

It would be cleaner still to avoid allocating and freeing the element each time.

@adammoody
Copy link
Member

adammoody commented Oct 17, 2023

Actually, after reviewing the code, you may be able to read this file back on the same node using dwalk.

The current v4 of the .mfu format stores file names as fixed length fields, where every file name is padded to the longest file name in the set.

/* copy in file name */
char* file = elem->file;
strncpy(ptr, file, chars);
ptr += chars;

That decision makes it easy to seek to a specific entry in the .mfu file, but it also significantly inflates the .mfu file size if there are many files and one really long filename:

disk space ~ (numfiles * max(filename length))

When reading the file entries back from the .mfu file, we copy each file name using strdup:

/* copy path */
elem->file = MFU_STRDUP(file);

The strdup will drop all of that extra padding, so the same list when read back into memory will take less space than when stored on disk.

If you generated this list from a single node, you might be able to read it back on that single node.

With the eventual v5 .mfu file format, whenever that comes, I hope we can store file names using variable length structures to avoid this problem. That will likely require the addition of an index to support efficient seeks.

@adammoody
Copy link
Member

If you find that you can't use dwalk, let me know, and I can hack up a branch with a tool to get you started.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants