Add Support for Transient Files #79

ric-evans · 2021-02-26T23:09:12Z

Off the bat, this might mean eliminating the "logical_name" field.

Current Relevant Scenarios:

When an actual file is moved, the corresponding FC record's "locations" object is manually updated. However, the "logical_name" retains the original path.
When an actual file is deleted the FC record is not deleted. Should it be?

Proposal:

Remove the "logical_name" field. This is at best redundant, and at worse a red herring.
Add an "active" field/flag to each "locations" object-entry: "active": True indicates the filepath is still valid.
Add a service to regularly check up on FC records
- This could either be an active service on a server;
- or a passive service tied into the FC REST server that updates FC records only when a filepath is requested (requires access to lustre)

The text was updated successfully, but these errors were encountered:

dsschult · 2021-02-27T00:25:54Z

We've gone back and forth about whether to keep deleted file records. I think there are three options:

Keep everything forever
Delete locations, but keep the metadata
Delete the record if there are no remaining locations

I am currently in favor of 2 or 3. I'd lean towards 3 right now, as it's hard to imagine why you'd want an entry with no locations. What we'd really be interested in is an audit log, if we wanted to undo things (or see who did things).

ric-evans · 2021-03-01T16:13:40Z

Delete the record if there are no remaining locations

I am currently in favor of 2 or 3. I'd lean towards 3 right now, as it's hard to imagine why you'd want an entry with no locations. What we'd really be interested in is an audit log, if we wanted to undo things (or see who did things).

The issue I see with outright deleting the record is if the file was deleted by mistake, or lost. There should be a way to get the metadata back without having to re-index the file, or alternatively be able to use the FC record to verify things. Adding a "verified timestamp" (along with an "active" flag) to each location could serve as a sparse audit log. Then we can have a service that deletes records that are "active": False and have a "verified timestamp" that is "old enough".

ric-evans · 2021-03-02T19:12:08Z

We've gone back and forth about whether to keep deleted file records. I think there are three options:

Keep everything forever

Delete locations, but keep the metadata

Delete the record if there are no remaining locations

I am currently in favor of 2 or 3. I'd lean towards 3 right now, as it's hard to imagine why you'd want an entry with no locations. What we'd really be interested in is an audit log, if we wanted to undo things (or see who did things).

TLDR from my comment above,
4. Quarantine deleted-filepath entries w/ a timestamp (then potentially remedy at a later date)

jnbellinger · 2021-03-02T19:25:42Z

I'm seeing a situation in which files were deleted deliberately, but the record remains and is causing problems.

If these had been actual data files, it might be good to retain a little information on what was lost. I have seen files be accidentally renamed/moved(*). That's rare, but being able to say "The checksum matches file X that we thought was lost" is a possible reason to keep the record information around.
But that's not the same thing as a file record anymore, so if we save the information at all we shouldn't call them file records. And I don't want to pick them up as some of the expected contents of a directory.

So I think I'm arguing for something like (2), but with the requirement that a query for directory information must exclude the missing files unless I specifically ask for them.

I agree that an audit is important.

(*) e.g. with a lustre crash

dsschult · 2021-03-02T19:54:44Z

One option for the "quarantine and delete later" is an archive collection, which will remove the entry from normal searches but allow us to still see it.

ric-evans · 2021-03-02T20:14:52Z

Responding to both of you:

@jnbellinger:

So I think I'm arguing for something like (2), but with the requirement that a query for directory information must exclude the missing files unless I specifically ask for them.

@dsschult:

One option for the "quarantine and delete later" is an archive collection, which will remove the entry from normal searches but allow us to still see it.

I'm not sure the overhead of an independent collection will pay off. We could keep the quarantined records/locations in the files collection, add a flag to each location, then filter these out for normal searches.

dsschult · 2021-03-02T20:22:56Z

One option for the "quarantine and delete later" is an archive collection, which will remove the entry from normal searches but allow us to still see it.

I'm not sure the overhead of an independent collection will pay off. We could keep the quarantined records/locations in the files collection, add a flag to each location, then filter these out for normal searches.

Right now, with the way things get indexed, it would pay off in speed / not having to create even more indexes 😄

ric-evans · 2021-03-02T20:49:38Z

One option for the "quarantine and delete later" is an archive collection, which will remove the entry from normal searches but allow us to still see it.

I'm not sure the overhead of an independent collection will pay off. We could keep the quarantined records/locations in the files collection, add a flag to each location, then filter these out for normal searches.

Right now, with the way things get indexed, it would pay off in speed / not having to create even more indexes

True, I'm assuming we fix the scaling-speed problem.

I'm thinking about files that are sent to NERSC and are later deleted off lfs. How would we consider this scenario (when not all the locations are active/inactive)?

jnbellinger · 2021-03-02T20:59:10Z

What do you mean "When not all the locations are active?" If we delete the PFRaw from lfs but have NERSC and/or the Pole disks, we still want to keep the file information active, right? We just remove one of the "pointers" associated with it.
If we accidentally delete the PFRaw, NERSC melts down, and we have a fire in Chamberlin that roasts the Pole copies, we might as well change the file states--manually.

dsschult · 2021-03-02T21:01:02Z

Files that get sent to NERSC get another location entry for that. So even if we delete the file from UW, there's still a location entry for NERSC.

ric-evans · 2021-03-02T21:13:30Z

What do you mean "When not all the locations are active?" If we delete the PFRaw from lfs but have NERSC and/or the Pole disks, we still want to keep the file information active, right? We just remove one of the "pointers" associated with it.
If we accidentally delete the PFRaw, NERSC melts down, and we have a fire in Chamberlin that roasts the Pole copies, we might as well change the file states--manually.

I'm using "locations" to mean "pointers", potato potato. Here's what I'm proposing:

Current Record Schema:

{ 
    <other-fields>,
    "locations": [
        {
            "site": "WIPAC",
            "path": "/data/exp/IceCube/path/to/file",
        },
        {
            "site": "NERSC",
            "path": "nersc:path/to/file",
        },
    ],
    <other-fields>
}

Proposed Record Schema:

{
    <other-fields>,
    "locations": [
        {
            "site": "WIPAC",
            "path": "/data/exp/IceCube/path/to/file",
            "active": False, // AKA file was deleted from lfs
            "verified-timestamp": <timestamp>, // some time after the path was noticed to be invalid/deleted/etc.
        },
        {
            "site": "NERSC",
            "path": "nersc:path/to/file",
            "active": True,
            "verified-timestamp": <timestamp>,
        },
    ],
    <other-fields>
}

jnbellinger · 2021-03-02T21:32:52Z

If I ask "What is in the directory WIPAC:/data/exp/IceCube/2025/unbiased/Gen2Raw/0231", how will the query know what files to return?

ric-evans · 2021-03-02T21:38:42Z

If I ask "What is in the directory WIPAC:/data/exp/IceCube/2025/unbiased/Gen2Raw/0231", how will the query know what files to return?

Currently that's not possible, but it will be very soon: #77

In my proposal, the query results would be filtered to only include records where "active" is True. Unless otherwise indicated.

ric-evans · 2021-03-02T21:46:52Z

If I ask "What is in the directory WIPAC:/data/exp/IceCube/2025/unbiased/Gen2Raw/0231", how will the query know what files to return?

Technically speaking, the FC would have a MongoDB index for each "locations"-entry filepath. This is a result of eliminating the "logical_name" field.

jnbellinger · 2021-03-02T21:52:15Z

Each "filepath" or the whole site+filepath?

ric-evans · 2021-03-02T21:55:29Z

Each "filepath"

A.K.A whatever is in "locations" -> "path"

jnbellinger · 2021-03-02T22:04:57Z

I can easily imagine another site using the same sort of /data/exp/ path name that we do. OTOH, any index at all would expedite the search, even if we needed another clause to get rid of the MALMO files.

ric-evans · 2021-03-02T22:14:20Z

Fair point. The requester would need to include the "site" field in their request query, or do client-side filtering like you said.

blinkdog · 2021-03-03T04:23:29Z

I think the File Catalog has had a bit of an identity crisis since it was created.

As David said, we've gone back and forth about what to do with deleted records. And I think this is less about what we should do, and rather more about not understanding what the File Catalog is (or is intended to be).

Here are two alternate visions for the File Catalog:

Extended Filesystem Metadata (a canonical record of IceCube's data as it currently exists)

If a file has a record in the File Catalog we know that it is a file that we intend to spend resources to keep. We know the canonical identity of the file (checksum), we know the canonical place the file would appear in a Data Warehouse file system (logical_name), and we know where we have copies of the file (locations), either directly or as part of an archive.

Files that are not part of IceCube's data as it currently exists are not eligible for the File Catalog. Anything deleted or lost should be removed from the File Catalog. "You don't have to go home, but you can't stay here."; the record can live somewhere else (if desired) but the File Catalog proper is always and foremost about the present state of the (Extended) Data Warehouse.

Oracle of File Metadata (the record of all we know about IceCube's data)

If a file has a record in the File Catalog we know that it was a file that we processed at one time.

Some data may be missing, including metadata about the file and even the file itself (if there are no known locations of the file). The File Catalog record reflects our best knowledge of the file, and serves as both a means to find active files and a means to remember the deleted and lost.

Of this File Catalog we can ask questions like "Has a file with any other checksum lived in the Data Warehouse at /path/to/file?"
Or "Which version(s) of $logical_name are archived and where?"
Or "Which files on retro media were lost when the container ship caught on fire three years ago?"

I think both of these visions are valid and useful for their intended purposes. However, there is a fundamental incompatibility between the two. We need to choose one. After that, I think questions about how to structure things and what to do will fall out almost naturally.

To be fair, I think the current use cases of the File Catalog fall into vision 1. LTA is concerned with what files need to go into archives, and updating those same files when the archives are known-good at the destination.

Consider this use case:

We archive some data; let's call it Level Z
Later, somebody realizes an obscure error had a rare effect on some data. Level Z gets the nod.
We recall the Level Z, fix it, re-index it for bundling.
3A. Locations says 'You can't have two files at /data/warehouse/path'; meaning the broken Level Z file (gone from the warehouse, but the record is still intact) and the fixed Level Z file (in the warehouse, but we can't make a record for it, because the broken one has that location, and duplicates are forbidden)
3B. We remove the record for the broken Level Z because why would we want that. Fixed Level Z gets a new record that contains the new path.
We bundle the fixed Level Z files up and send the archive to NERSC.
Some time later, somebody queries the File Catalog for a specific time frame and finds a Level Z file (not the one we fixed, but maybe a directory and archive sibling) and sees the file lives in two archives at NERSC (one containing the broken sibling, one containing the fixed sibling), and recalls both.
When the bundles come back, the checksum for the broken sibling fails. (File data checksum does not match the checksum from the Catalog record.)

The question on our minds: What is this file and why is the checksum not matching?

Is the File Catalog the service that should be answering that question?
Does it tell us, "That's an older version of file X that was superceded by file Y?"
Or does it tell us, "Well, it doesn't match what I have on file. Look it up in the oracle service maybe?"

dsschult · 2021-03-03T05:04:34Z

@blinkdog That's an excellent point. My vision is of 1. Extended Filesystem Metadata, except that I expand the data warehouse to "anything I can get the file back from." So I consider NERSC and DESY to be perfectly acceptable locations for files I care about.

In the overwrite example, I would first delete the local location from the old record, then add a new record with that location. It would be up to some cleanup operation to delete the NERSC archive, and finally delete the old record. Of course, this comes with a problem if you wanted unique logical names, as you would have two of them for a short time. But if this example is how we want things to appear, there are software solutions to make that happen.

jnbellinger · 2021-03-03T15:36:04Z

As a user who has been granted umpteen hours on a cluster in Istanbul, I want to know from where I can pull the data, and don't care particularly whether Madison has it or not. A Madison-centric system is too limited.
So: Case 2 or David's extended Case 1

You're right, this demands some kind of version control in the FileCatalog and the retrieval procedures. It seems perfectly possible for some site to have an older level2 version, or a mix (replacement is still in progress). An analysis working on a 10-year study at some site may want to stay bug-compatible for the lifetime of the analysis--and not mix in newer data file versions.

jnbellinger · 2021-03-03T16:31:20Z

WRT Analysis Reproducibility: An analysis record should refer to what version of the data files it used.

ric-evans · 2021-03-03T17:20:47Z

@blinkdog That really does put this in perspective. I hadn't thought about files that are modified but remain at the same path.

I like @dsschult's Extended Case 1, let call it Global Filesystem Metadata (FWIW I don't think @blinkdog restricted his original case 1 to Madison-only files).

WRT overwritten files, we could move the original record to a "graveyard" collection (where we don't care about duplicate paths). @dsschult proposed something similar earlier

TLDR
2 collections: (1) a collection for data files globally accessible as of today, and (2) a collection for data files no longer accessible, AKA the "graveyard".

ric-evans · 2021-03-03T17:22:57Z

In a radically different approach, we could only require unique checksums (the sole primary key), and keep everything forever.

jnbellinger · 2021-03-03T17:29:52Z

Empty files have the same checksum :-)

dsschult · 2021-03-03T17:33:00Z

Yeah, checksums aren't unique because of that issue (and other small files that would be the "same"). While technically the contents are identical, the metadata could be different.

ric-evans · 2021-03-03T17:35:38Z

too radical I suppose 😆

ric-evans · 2021-03-16T16:21:15Z

Another issue with this automation is that a file's metadata changes when a gaps file (or other auxiliary file in the same directory) is added/modified. The individual file's checksum remains the same but it would still need to be re-indexed.

ric-evans · 2021-09-07T18:54:00Z

Further discussion, including new and relevant use cases, by @jnbellinger: https://docs.google.com/document/d/1DkzX5VDNTxmQUOofkdGdbfykZE6Dvu8VFShyCZ19_1I

ric-evans · 2021-09-09T23:04:21Z

This issue is spinning off into #109, which will create an interem solution.

Updates to follow.

ric-evans added enhancement data fix modify existing persisted data labels Feb 26, 2021

ric-evans mentioned this issue Feb 26, 2021

sites and locations #21

Closed

ric-evans closed this as completed Mar 2, 2021

ric-evans reopened this Mar 2, 2021

ric-evans added the significant dev this'll take some time label Apr 20, 2021

ric-evans mentioned this issue Sep 9, 2021

make syncing script between fs and FC #109

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Support for Transient Files #79

Add Support for Transient Files #79

ric-evans commented Feb 26, 2021 •

edited

Loading

dsschult commented Feb 27, 2021

ric-evans commented Mar 1, 2021 •

edited

Loading

ric-evans commented Mar 2, 2021 •

edited

Loading

jnbellinger commented Mar 2, 2021

dsschult commented Mar 2, 2021

ric-evans commented Mar 2, 2021

dsschult commented Mar 2, 2021

ric-evans commented Mar 2, 2021 •

edited

Loading

jnbellinger commented Mar 2, 2021

dsschult commented Mar 2, 2021

ric-evans commented Mar 2, 2021 •

edited

Loading

jnbellinger commented Mar 2, 2021

ric-evans commented Mar 2, 2021 •

edited

Loading

ric-evans commented Mar 2, 2021 •

edited

Loading

jnbellinger commented Mar 2, 2021

ric-evans commented Mar 2, 2021 •

edited

Loading

jnbellinger commented Mar 2, 2021

ric-evans commented Mar 2, 2021 •

edited

Loading

blinkdog commented Mar 3, 2021 •

edited

Loading

dsschult commented Mar 3, 2021

jnbellinger commented Mar 3, 2021

jnbellinger commented Mar 3, 2021

ric-evans commented Mar 3, 2021

ric-evans commented Mar 3, 2021

jnbellinger commented Mar 3, 2021

dsschult commented Mar 3, 2021

ric-evans commented Mar 3, 2021

ric-evans commented Mar 16, 2021

ric-evans commented Sep 7, 2021

ric-evans commented Sep 9, 2021

Add Support for Transient Files #79

Add Support for Transient Files #79

Comments

ric-evans commented Feb 26, 2021 • edited Loading

dsschult commented Feb 27, 2021

ric-evans commented Mar 1, 2021 • edited Loading

ric-evans commented Mar 2, 2021 • edited Loading

jnbellinger commented Mar 2, 2021

dsschult commented Mar 2, 2021

ric-evans commented Mar 2, 2021

dsschult commented Mar 2, 2021

ric-evans commented Mar 2, 2021 • edited Loading

jnbellinger commented Mar 2, 2021

dsschult commented Mar 2, 2021

ric-evans commented Mar 2, 2021 • edited Loading

jnbellinger commented Mar 2, 2021

ric-evans commented Mar 2, 2021 • edited Loading

ric-evans commented Mar 2, 2021 • edited Loading

jnbellinger commented Mar 2, 2021

ric-evans commented Mar 2, 2021 • edited Loading

jnbellinger commented Mar 2, 2021

ric-evans commented Mar 2, 2021 • edited Loading

blinkdog commented Mar 3, 2021 • edited Loading

dsschult commented Mar 3, 2021

jnbellinger commented Mar 3, 2021

jnbellinger commented Mar 3, 2021

ric-evans commented Mar 3, 2021

ric-evans commented Mar 3, 2021

jnbellinger commented Mar 3, 2021

dsschult commented Mar 3, 2021

ric-evans commented Mar 3, 2021

ric-evans commented Mar 16, 2021

ric-evans commented Sep 7, 2021

ric-evans commented Sep 9, 2021

ric-evans commented Feb 26, 2021 •

edited

Loading

ric-evans commented Mar 1, 2021 •

edited

Loading

ric-evans commented Mar 2, 2021 •

edited

Loading

ric-evans commented Mar 2, 2021 •

edited

Loading

ric-evans commented Mar 2, 2021 •

edited

Loading

ric-evans commented Mar 2, 2021 •

edited

Loading

ric-evans commented Mar 2, 2021 •

edited

Loading

ric-evans commented Mar 2, 2021 •

edited

Loading

ric-evans commented Mar 2, 2021 •

edited

Loading

blinkdog commented Mar 3, 2021 •

edited

Loading