Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Support for Transient Files #79

Open
ric-evans opened this issue Feb 26, 2021 · 30 comments
Open

Add Support for Transient Files #79

ric-evans opened this issue Feb 26, 2021 · 30 comments
Labels
data fix modify existing persisted data enhancement significant dev this'll take some time

Comments

@ric-evans
Copy link
Member

ric-evans commented Feb 26, 2021

Off the bat, this might mean eliminating the "logical_name" field.

Current Relevant Scenarios:

  1. When an actual file is moved, the corresponding FC record's "locations" object is manually updated. However, the "logical_name" retains the original path.
  2. When an actual file is deleted the FC record is not deleted. Should it be?

Proposal:

  1. Remove the "logical_name" field. This is at best redundant, and at worse a red herring.
  2. Add an "active" field/flag to each "locations" object-entry: "active": True indicates the filepath is still valid.
  3. Add a service to regularly check up on FC records
    • This could either be an active service on a server;
    • or a passive service tied into the FC REST server that updates FC records only when a filepath is requested (requires access to lustre)
@ric-evans ric-evans added enhancement data fix modify existing persisted data labels Feb 26, 2021
@dsschult
Copy link
Collaborator

We've gone back and forth about whether to keep deleted file records. I think there are three options:

  1. Keep everything forever
  2. Delete locations, but keep the metadata
  3. Delete the record if there are no remaining locations

I am currently in favor of 2 or 3. I'd lean towards 3 right now, as it's hard to imagine why you'd want an entry with no locations. What we'd really be interested in is an audit log, if we wanted to undo things (or see who did things).

@ric-evans
Copy link
Member Author

ric-evans commented Mar 1, 2021

  1. Delete the record if there are no remaining locations

I am currently in favor of 2 or 3. I'd lean towards 3 right now, as it's hard to imagine why you'd want an entry with no locations. What we'd really be interested in is an audit log, if we wanted to undo things (or see who did things).

The issue I see with outright deleting the record is if the file was deleted by mistake, or lost. There should be a way to get the metadata back without having to re-index the file, or alternatively be able to use the FC record to verify things. Adding a "verified timestamp" (along with an "active" flag) to each location could serve as a sparse audit log. Then we can have a service that deletes records that are "active": False and have a "verified timestamp" that is "old enough".

@ric-evans
Copy link
Member Author

ric-evans commented Mar 2, 2021

We've gone back and forth about whether to keep deleted file records. I think there are three options:

  1. Keep everything forever
  2. Delete locations, but keep the metadata
  3. Delete the record if there are no remaining locations

I am currently in favor of 2 or 3. I'd lean towards 3 right now, as it's hard to imagine why you'd want an entry with no locations. What we'd really be interested in is an audit log, if we wanted to undo things (or see who did things).

TLDR from my comment above,
4. Quarantine deleted-filepath entries w/ a timestamp (then potentially remedy at a later date)

@jnbellinger
Copy link

I'm seeing a situation in which files were deleted deliberately, but the record remains and is causing problems.

If these had been actual data files, it might be good to retain a little information on what was lost. I have seen files be accidentally renamed/moved(*). That's rare, but being able to say "The checksum matches file X that we thought was lost" is a possible reason to keep the record information around.
But that's not the same thing as a file record anymore, so if we save the information at all we shouldn't call them file records. And I don't want to pick them up as some of the expected contents of a directory.

So I think I'm arguing for something like (2), but with the requirement that a query for directory information must exclude the missing files unless I specifically ask for them.

I agree that an audit is important.

(*) e.g. with a lustre crash

@dsschult
Copy link
Collaborator

dsschult commented Mar 2, 2021

One option for the "quarantine and delete later" is an archive collection, which will remove the entry from normal searches but allow us to still see it.

@ric-evans
Copy link
Member Author

Responding to both of you:

@jnbellinger:

So I think I'm arguing for something like (2), but with the requirement that a query for directory information must exclude the missing files unless I specifically ask for them.

@dsschult:

One option for the "quarantine and delete later" is an archive collection, which will remove the entry from normal searches but allow us to still see it.

I'm not sure the overhead of an independent collection will pay off. We could keep the quarantined records/locations in the files collection, add a flag to each location, then filter these out for normal searches.

@dsschult
Copy link
Collaborator

dsschult commented Mar 2, 2021

One option for the "quarantine and delete later" is an archive collection, which will remove the entry from normal searches but allow us to still see it.

I'm not sure the overhead of an independent collection will pay off. We could keep the quarantined records/locations in the files collection, add a flag to each location, then filter these out for normal searches.

Right now, with the way things get indexed, it would pay off in speed / not having to create even more indexes 😄

@ric-evans
Copy link
Member Author

ric-evans commented Mar 2, 2021

One option for the "quarantine and delete later" is an archive collection, which will remove the entry from normal searches but allow us to still see it.

I'm not sure the overhead of an independent collection will pay off. We could keep the quarantined records/locations in the files collection, add a flag to each location, then filter these out for normal searches.

Right now, with the way things get indexed, it would pay off in speed / not having to create even more indexes

True, I'm assuming we fix the scaling-speed problem.

I'm thinking about files that are sent to NERSC and are later deleted off lfs. How would we consider this scenario (when not all the locations are active/inactive)?

@jnbellinger
Copy link

What do you mean "When not all the locations are active?" If we delete the PFRaw from lfs but have NERSC and/or the Pole disks, we still want to keep the file information active, right? We just remove one of the "pointers" associated with it.
If we accidentally delete the PFRaw, NERSC melts down, and we have a fire in Chamberlin that roasts the Pole copies, we might as well change the file states--manually.

@dsschult
Copy link
Collaborator

dsschult commented Mar 2, 2021

Files that get sent to NERSC get another location entry for that. So even if we delete the file from UW, there's still a location entry for NERSC.

@ric-evans ric-evans reopened this Mar 2, 2021
@ric-evans
Copy link
Member Author

ric-evans commented Mar 2, 2021

What do you mean "When not all the locations are active?" If we delete the PFRaw from lfs but have NERSC and/or the Pole disks, we still want to keep the file information active, right? We just remove one of the "pointers" associated with it.
If we accidentally delete the PFRaw, NERSC melts down, and we have a fire in Chamberlin that roasts the Pole copies, we might as well change the file states--manually.

I'm using "locations" to mean "pointers", potato potato. Here's what I'm proposing:

Current Record Schema:

{ 
    <other-fields>,
    "locations": [
        {
            "site": "WIPAC",
            "path": "/data/exp/IceCube/path/to/file",
        },
        {
            "site": "NERSC",
            "path": "nersc:path/to/file",
        },
    ],
    <other-fields>
}

Proposed Record Schema:

{
    <other-fields>,
    "locations": [
        {
            "site": "WIPAC",
            "path": "/data/exp/IceCube/path/to/file",
            "active": False, // AKA file was deleted from lfs
            "verified-timestamp": <timestamp>, // some time after the path was noticed to be invalid/deleted/etc.
        },
        {
            "site": "NERSC",
            "path": "nersc:path/to/file",
            "active": True,
            "verified-timestamp": <timestamp>,
        },
    ],
    <other-fields>
}

@jnbellinger
Copy link

If I ask "What is in the directory WIPAC:/data/exp/IceCube/2025/unbiased/Gen2Raw/0231", how will the query know what files to return?

@ric-evans
Copy link
Member Author

ric-evans commented Mar 2, 2021

If I ask "What is in the directory WIPAC:/data/exp/IceCube/2025/unbiased/Gen2Raw/0231", how will the query know what files to return?

Currently that's not possible, but it will be very soon: #77

In my proposal, the query results would be filtered to only include records where "active" is True. Unless otherwise indicated.

@ric-evans
Copy link
Member Author

ric-evans commented Mar 2, 2021

If I ask "What is in the directory WIPAC:/data/exp/IceCube/2025/unbiased/Gen2Raw/0231", how will the query know what files to return?

Technically speaking, the FC would have a MongoDB index for each "locations"-entry filepath. This is a result of eliminating the "logical_name" field.

@jnbellinger
Copy link

Each "filepath" or the whole site+filepath?

@ric-evans
Copy link
Member Author

ric-evans commented Mar 2, 2021

Each "filepath"

A.K.A whatever is in "locations" -> "path"

@jnbellinger
Copy link

I can easily imagine another site using the same sort of /data/exp/ path name that we do. OTOH, any index at all would expedite the search, even if we needed another clause to get rid of the MALMO files.

@ric-evans
Copy link
Member Author

ric-evans commented Mar 2, 2021

Fair point. The requester would need to include the "site" field in their request query, or do client-side filtering like you said.

@blinkdog
Copy link
Collaborator

blinkdog commented Mar 3, 2021

I think the File Catalog has had a bit of an identity crisis since it was created.

As David said, we've gone back and forth about what to do with deleted records. And I think this is less about what we should do, and rather more about not understanding what the File Catalog is (or is intended to be).

Here are two alternate visions for the File Catalog:

  1. Extended Filesystem Metadata (a canonical record of IceCube's data as it currently exists)

If a file has a record in the File Catalog we know that it is a file that we intend to spend resources to keep. We know the canonical identity of the file (checksum), we know the canonical place the file would appear in a Data Warehouse file system (logical_name), and we know where we have copies of the file (locations), either directly or as part of an archive.

Files that are not part of IceCube's data as it currently exists are not eligible for the File Catalog. Anything deleted or lost should be removed from the File Catalog. "You don't have to go home, but you can't stay here."; the record can live somewhere else (if desired) but the File Catalog proper is always and foremost about the present state of the (Extended) Data Warehouse.

  1. Oracle of File Metadata (the record of all we know about IceCube's data)

If a file has a record in the File Catalog we know that it was a file that we processed at one time.

Some data may be missing, including metadata about the file and even the file itself (if there are no known locations of the file). The File Catalog record reflects our best knowledge of the file, and serves as both a means to find active files and a means to remember the deleted and lost.

Of this File Catalog we can ask questions like "Has a file with any other checksum lived in the Data Warehouse at /path/to/file?"
Or "Which version(s) of $logical_name are archived and where?"
Or "Which files on retro media were lost when the container ship caught on fire three years ago?"

I think both of these visions are valid and useful for their intended purposes. However, there is a fundamental incompatibility between the two. We need to choose one. After that, I think questions about how to structure things and what to do will fall out almost naturally.

To be fair, I think the current use cases of the File Catalog fall into vision 1. LTA is concerned with what files need to go into archives, and updating those same files when the archives are known-good at the destination.

Consider this use case:

  1. We archive some data; let's call it Level Z
  2. Later, somebody realizes an obscure error had a rare effect on some data. Level Z gets the nod.
  3. We recall the Level Z, fix it, re-index it for bundling.
    3A. Locations says 'You can't have two files at /data/warehouse/path'; meaning the broken Level Z file (gone from the warehouse, but the record is still intact) and the fixed Level Z file (in the warehouse, but we can't make a record for it, because the broken one has that location, and duplicates are forbidden)
    3B. We remove the record for the broken Level Z because why would we want that. Fixed Level Z gets a new record that contains the new path.
  4. We bundle the fixed Level Z files up and send the archive to NERSC.
  5. Some time later, somebody queries the File Catalog for a specific time frame and finds a Level Z file (not the one we fixed, but maybe a directory and archive sibling) and sees the file lives in two archives at NERSC (one containing the broken sibling, one containing the fixed sibling), and recalls both.
  6. When the bundles come back, the checksum for the broken sibling fails. (File data checksum does not match the checksum from the Catalog record.)

The question on our minds: What is this file and why is the checksum not matching?

Is the File Catalog the service that should be answering that question?
Does it tell us, "That's an older version of file X that was superceded by file Y?"
Or does it tell us, "Well, it doesn't match what I have on file. Look it up in the oracle service maybe?"

@dsschult
Copy link
Collaborator

dsschult commented Mar 3, 2021

@blinkdog That's an excellent point. My vision is of 1. Extended Filesystem Metadata, except that I expand the data warehouse to "anything I can get the file back from." So I consider NERSC and DESY to be perfectly acceptable locations for files I care about.

In the overwrite example, I would first delete the local location from the old record, then add a new record with that location. It would be up to some cleanup operation to delete the NERSC archive, and finally delete the old record. Of course, this comes with a problem if you wanted unique logical names, as you would have two of them for a short time. But if this example is how we want things to appear, there are software solutions to make that happen.

@jnbellinger
Copy link

As a user who has been granted umpteen hours on a cluster in Istanbul, I want to know from where I can pull the data, and don't care particularly whether Madison has it or not. A Madison-centric system is too limited.
So: Case 2 or David's extended Case 1

You're right, this demands some kind of version control in the FileCatalog and the retrieval procedures. It seems perfectly possible for some site to have an older level2 version, or a mix (replacement is still in progress). An analysis working on a 10-year study at some site may want to stay bug-compatible for the lifetime of the analysis--and not mix in newer data file versions.

@jnbellinger
Copy link

WRT Analysis Reproducibility: An analysis record should refer to what version of the data files it used.

@ric-evans
Copy link
Member Author

@blinkdog That really does put this in perspective. I hadn't thought about files that are modified but remain at the same path.

I like @dsschult's Extended Case 1, let call it Global Filesystem Metadata (FWIW I don't think @blinkdog restricted his original case 1 to Madison-only files).

WRT overwritten files, we could move the original record to a "graveyard" collection (where we don't care about duplicate paths). @dsschult proposed something similar earlier

TLDR
2 collections: (1) a collection for data files globally accessible as of today, and (2) a collection for data files no longer accessible, AKA the "graveyard".

@ric-evans
Copy link
Member Author

In a radically different approach, we could only require unique checksums (the sole primary key), and keep everything forever.

@jnbellinger
Copy link

Empty files have the same checksum :-)

@dsschult
Copy link
Collaborator

dsschult commented Mar 3, 2021

Yeah, checksums aren't unique because of that issue (and other small files that would be the "same"). While technically the contents are identical, the metadata could be different.

@ric-evans
Copy link
Member Author

too radical I suppose 😆

@ric-evans
Copy link
Member Author

Another issue with this automation is that a file's metadata changes when a gaps file (or other auxiliary file in the same directory) is added/modified. The individual file's checksum remains the same but it would still need to be re-indexed.

@ric-evans ric-evans added the significant dev this'll take some time label Apr 20, 2021
@ric-evans
Copy link
Member Author

Further discussion, including new and relevant use cases, by @jnbellinger: https://docs.google.com/document/d/1DkzX5VDNTxmQUOofkdGdbfykZE6Dvu8VFShyCZ19_1I

@ric-evans
Copy link
Member Author

This issue is spinning off into #109, which will create an interem solution.

Updates to follow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data fix modify existing persisted data enhancement significant dev this'll take some time
Projects
None yet
Development

No branches or pull requests

4 participants