-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Support for Transient Files #79
Comments
We've gone back and forth about whether to keep deleted file records. I think there are three options:
I am currently in favor of 2 or 3. I'd lean towards 3 right now, as it's hard to imagine why you'd want an entry with no locations. What we'd really be interested in is an audit log, if we wanted to undo things (or see who did things). |
The issue I see with outright deleting the record is if the file was deleted by mistake, or lost. There should be a way to get the metadata back without having to re-index the file, or alternatively be able to use the FC record to verify things. Adding a |
TLDR from my comment above, |
I'm seeing a situation in which files were deleted deliberately, but the record remains and is causing problems. If these had been actual data files, it might be good to retain a little information on what was lost. I have seen files be accidentally renamed/moved(*). That's rare, but being able to say "The checksum matches file X that we thought was lost" is a possible reason to keep the record information around. So I think I'm arguing for something like (2), but with the requirement that a query for directory information must exclude the missing files unless I specifically ask for them. I agree that an audit is important. (*) e.g. with a lustre crash |
One option for the "quarantine and delete later" is an archive collection, which will remove the entry from normal searches but allow us to still see it. |
Responding to both of you:
I'm not sure the overhead of an independent collection will pay off. We could keep the quarantined records/locations in the |
Right now, with the way things get indexed, it would pay off in speed / not having to create even more indexes 😄 |
True, I'm assuming we fix the scaling-speed problem. I'm thinking about files that are sent to NERSC and are later deleted off lfs. How would we consider this scenario (when not all the locations are active/inactive)? |
What do you mean "When not all the locations are active?" If we delete the PFRaw from lfs but have NERSC and/or the Pole disks, we still want to keep the file information active, right? We just remove one of the "pointers" associated with it. |
Files that get sent to NERSC get another location entry for that. So even if we delete the file from UW, there's still a location entry for NERSC. |
I'm using "locations" to mean "pointers", potato potato. Here's what I'm proposing: Current Record Schema:
Proposed Record Schema:
|
If I ask "What is in the directory WIPAC:/data/exp/IceCube/2025/unbiased/Gen2Raw/0231", how will the query know what files to return? |
Currently that's not possible, but it will be very soon: #77 In my proposal, the query results would be filtered to only include records where |
Technically speaking, the FC would have a MongoDB index for each |
Each "filepath" or the whole site+filepath? |
A.K.A whatever is in |
I can easily imagine another site using the same sort of /data/exp/ path name that we do. OTOH, any index at all would expedite the search, even if we needed another clause to get rid of the MALMO files. |
Fair point. The requester would need to include the |
I think the File Catalog has had a bit of an identity crisis since it was created. As David said, we've gone back and forth about what to do with deleted records. And I think this is less about what we should do, and rather more about not understanding what the File Catalog is (or is intended to be). Here are two alternate visions for the File Catalog:
If a file has a record in the File Catalog we know that it is a file that we intend to spend resources to keep. We know the canonical identity of the file (checksum), we know the canonical place the file would appear in a Data Warehouse file system (logical_name), and we know where we have copies of the file (locations), either directly or as part of an archive. Files that are not part of IceCube's data as it currently exists are not eligible for the File Catalog. Anything deleted or lost should be removed from the File Catalog. "You don't have to go home, but you can't stay here."; the record can live somewhere else (if desired) but the File Catalog proper is always and foremost about the present state of the (Extended) Data Warehouse.
If a file has a record in the File Catalog we know that it was a file that we processed at one time. Some data may be missing, including metadata about the file and even the file itself (if there are no known locations of the file). The File Catalog record reflects our best knowledge of the file, and serves as both a means to find active files and a means to remember the deleted and lost. Of this File Catalog we can ask questions like "Has a file with any other checksum lived in the Data Warehouse at /path/to/file?" I think both of these visions are valid and useful for their intended purposes. However, there is a fundamental incompatibility between the two. We need to choose one. After that, I think questions about how to structure things and what to do will fall out almost naturally. To be fair, I think the current use cases of the File Catalog fall into vision 1. LTA is concerned with what files need to go into archives, and updating those same files when the archives are known-good at the destination. Consider this use case:
The question on our minds: What is this file and why is the checksum not matching? Is the File Catalog the service that should be answering that question? |
@blinkdog That's an excellent point. My vision is of 1. Extended Filesystem Metadata, except that I expand the data warehouse to "anything I can get the file back from." So I consider NERSC and DESY to be perfectly acceptable locations for files I care about. In the overwrite example, I would first delete the local location from the old record, then add a new record with that location. It would be up to some cleanup operation to delete the NERSC archive, and finally delete the old record. Of course, this comes with a problem if you wanted unique logical names, as you would have two of them for a short time. But if this example is how we want things to appear, there are software solutions to make that happen. |
As a user who has been granted umpteen hours on a cluster in Istanbul, I want to know from where I can pull the data, and don't care particularly whether Madison has it or not. A Madison-centric system is too limited. You're right, this demands some kind of version control in the FileCatalog and the retrieval procedures. It seems perfectly possible for some site to have an older level2 version, or a mix (replacement is still in progress). An analysis working on a 10-year study at some site may want to stay bug-compatible for the lifetime of the analysis--and not mix in newer data file versions. |
WRT Analysis Reproducibility: An analysis record should refer to what version of the data files it used. |
@blinkdog That really does put this in perspective. I hadn't thought about files that are modified but remain at the same path. I like @dsschult's Extended Case 1, let call it Global Filesystem Metadata (FWIW I don't think @blinkdog restricted his original case 1 to Madison-only files). WRT overwritten files, we could move the original record to a "graveyard" collection (where we don't care about duplicate paths). @dsschult proposed something similar earlier TLDR |
In a radically different approach, we could only require unique checksums (the sole primary key), and keep everything forever. |
Empty files have the same checksum :-) |
Yeah, checksums aren't unique because of that issue (and other small files that would be the "same"). While technically the contents are identical, the metadata could be different. |
too radical I suppose 😆 |
Another issue with this automation is that a file's metadata changes when a gaps file (or other auxiliary file in the same directory) is added/modified. The individual file's checksum remains the same but it would still need to be re-indexed. |
Further discussion, including new and relevant use cases, by @jnbellinger: https://docs.google.com/document/d/1DkzX5VDNTxmQUOofkdGdbfykZE6Dvu8VFShyCZ19_1I |
This issue is spinning off into #109, which will create an interem solution. Updates to follow. |
Off the bat, this might mean eliminating the
"logical_name"
field.Current Relevant Scenarios:
"locations"
object is manually updated. However, the"logical_name"
retains the original path.Proposal:
"logical_name"
field. This is at best redundant, and at worse a red herring."active"
field/flag to each"locations"
object-entry:"active": True
indicates the filepath is still valid.The text was updated successfully, but these errors were encountered: