-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
I believe we're using gen3/indexd wrong #487
Comments
Another consequence of this is that gen3 can't keep up with parallel edits and starts throwing 500 errors. See the apps-prd/dataservice-api cloudwatch logs from 2019-03-15 |
@fiendish This is going to be a problem no matter, probably more so if we pull out all the data into our database and only load it in batch upon release. |
I think we need to remove all interaction between the dataservice GF and gen3. Make the dataservice GF POST/PATCH only take the file's gen3 uid which must already exist and do not return size/hash/name or any other forwarded data from gen3 on a GF GET request to the dataservice. Do not delete from gen3 when deleting a GF from the dataservice. We can keep putting the file path into external_id like we mostly currently do already which is needed for local testing anyway because local dataservice instances don't use gen3 at all. If portal ETL needs file sizes etc, they can ask gen3 directly. |
Gen3 throws internal errors if you ask it to walk and chew gum at the same time. Then dataservice forwards the 500 status, which basically kills GF endpoint access until kids-first/kf-api-dataservice#487 gets fixed.
Compare also (genomic-files endpoint)
vs (diagnoses endpoint)
or (sequencing-experiments endpoint)
|
Were you hitting all of these from the VPN? |
yes |
I see, it wouldn't have made a difference anyway, I thought you were also testing indexd directly. |
Listing may be made somewhat better by PR #448, but the significant delay on doing anything to any GF won't be improved. I sent PATCH to the GF endpoint over 30,000 times today, and I had to do it sequentially because of #487 (comment). It took hours to complete instead of minutes. And PR #448 doesn't address the primary problem #487 (comment) that externally-facing file download links will get broken all the time when they're never supposed to break. That's a big deal. I mean...I can tell my code to just keep retrying when the server errors, even though I think it shouldn't error, and I can tell my code to only ever send one request at a time to reduce the incidence of said errors, and then I can start it running and walk away for a few hours, or a few days, or a few weeks. It's annoying, and I shouldn't have to do that because I'm not actually desiring to interact with gen3 in any way at all when I hit the GF endpoint, but at least it doesn't break the outside world. Failing to put the "perma" in "permalink", however, has potentially terrible consequences for the project's reputation. I also think that fixing this issue as suggested in #487 (comment) (by completely decoupling the dataservice from gen3 and no longer proxy forwarding file size/hashes metadata) means that PR #448 becomes obsolete. |
In order to move forward with this, we need a plan for the following:
I presume the ETL process will update dataservice files with some indexd uuid for the file, if not, we may have to make changes to the Portal ETL to extract from indexd. |
Or just not. There's no great reason for the dataservice to reflect what indexd stores. We should:
|
Possibly adding more weight to getting this done and over with, apparently we don't handle gen3 errors humanely in the dataservice either. Note: I do not want us to fix this error handling. I want us to do #487 (comment) . This is just more information about how fragile the gen3 interface is. See what passing an invalid md5 checksum (it was an S3 etag which is often not an md5 of the contents) looks like in our logs:
|
If we don't migrate back to the dataservice we'd have to instead update the portal ETL code to pull genomic files from indexd. |
That is exactly what should happen (except replace "we" because we don't maintain the portal ETL). |
Indexd is a permalink index. It is supposed to keep a persistent reference to a file wherever the file happens to be. It does this by associating the [hash & file size] of the registered file with a new unique identifier.
Expected usage of indexd comes from the description for Sheepdog and a gen3 forum response on the subject of deleting IDs:
If the same file is ever uploaded again, it is supposed to be able to be linked to the same ID as before. Indexd should reliably map GUID->BLOB_DATA forever, and if we can't (or don't want to) do that then there's no point in using indexd in the first place.
This means that entries should never be deleted from indexd for any reason, because if an entry is deleted then subsequent introduction of the file would not reconnect to the previously established ID. Redaction and other forms of access control need to happen at a layer other than ID existence.
So we're currently making two rather large errors:
We don't generate stable content-based hashes of the files. S3 ETags are unique instance identifiers on S3, but notably are not stable content-based hashes, because they are generated differently depending on how the file is uploaded and which S3 encryption option is used. Checking the index for something that isn't a stable content-based hash violates most (all?) of the purpose of Gen3, because introduction of the file on another platform or even with just different S3 settings will not transfer the original ID unless we also forever transport the original non-standard Etags around with the files, which would be a bad design principle requiring forever preserving a significant bit of arcane knowledge about historical shortcuts.
We delete IDs from the index (upon removal of an attached entity from the dataservice) when we're not supposed to do that. Lack of presence in the dataservice, momentary or otherwise, should not disrupt stably mapping GUID->BLOB_DATA. Currently subsequent reintroduction of a file after removal of its associated GF entity from the dataservice will not get assigned the original ID, resulting in broken references.
Fixing (2) will also make it sane to delete and reload studies.
The text was updated successfully, but these errors were encountered: