-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
move system outputs and analysis cases to Cloud Storage #335
Conversation
fce86fd
to
4dc324e
Compare
4520585
to
7138a0d
Compare
7138a0d
to
2a86d65
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks nice!
Following this PR, maybe we could write a script that pulls all the files out of the MongoDB and writes them to cloud storage?
I approve personally, think @odashi definitely needs to take a look at this before it's merged though.
insert_list = [] | ||
sample_list = [general_to_dict(v) for v in system_output_data.samples] | ||
sample_compressed = SystemDBUtils._compress_data(sample_list) | ||
|
||
blob_name = f"{system_id}/{SystemDBUtils._SYSTEM_OUTPUT_CONST}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how the GCP file system works, but would it be OK to, for example, have 1M files in one directory? If so then this is fine, but if not then we might want to have a hierarchical structure such as f"{system_id[0:3]}/{system_id}/{SystemDBUtils._SYSTEM_OUTPUT_CONST}"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I couldn't find a place in their doc where it explicitly says this but I think Cloud Storage (like S3) can basically hold any number of objects/files (S3 documentation says unlimited scalability). The only limit is that each object cannot exceed 5TB. So, I think this implementation is scalable.
re creating folders: Cloud Storage uses a flat namespace. If the object name contains "/"s, they are treated as part of the name. It does support folders but it basically creates an illusion for the user so it's easier to navigate the files with gsutil
or on the web UI.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cloud Storage uses a flat namespace.
Correct. Cloud Storage (and some other storage services) adopts flat ("no folders") structure and all allowed characters (including /
) are treated as a part of file name.
Cloud storage can store basically unlimited number of files as long as the data center allows. The only concern is the increase of cost of managing the data.
) | ||
data = next(cursor)["data"] | ||
# NOTE: (backward compatibility) data was stored in MongoDB previously, | ||
# this can be removed after we regenerate DB next time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Explaining what "this" is would make this more clear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have updated the comment.
Thanks, @neubig! I implemented the code in a backward compatible way because I think it'll be easier to handle it in the code than migrate the DB. If we decide to migrate the DB with a script, I'll remove the code that's added for compatibility reasons. |
@lyuyangh Could you prepare the plan of DB migration in some documents (or issues)? It is useful to understand how/when backward-incompatible changes will happen on this repository. |
@@ -6,7 +6,7 @@ This repository includes code for frontend and backend of the ExplainaBoard web | |||
|
|||
[Schema Design](https://docs.google.com/document/d/1my-zuIYosrXuoqOk1SvqZDsC2tdMgs_A7HTAtPYXKLI/edit?usp=sharing) | |||
|
|||
## Quick Start | |||
## Quick Start for Developers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the "developer" here mean the developers of this software?
If it means any developers, it may involve some security risks (especially granting access to GCS) and I couldn't allow them to give authorization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, developers of explainaboard_web. I don't think outside developers can actually contribute to this codebase at this point. As you mentioned, it's not secure to give them access to GCP and AWS so they'll need to build their own infrastructure (Cloud Storage, Cognito, MongoDB). If we want outside contributors, I guess we need a way for them to easily mock or set up local instances to replace these services.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, thanks!
Co-authored-by: Yusuke Oda <[email protected]>
2b33fd9
to
b344125
Compare
@odashi Thanks for the comments! I have fixed all of them. We don't need to do a DB migration for this PR (it is backward compatible). But I agree, it'll be helpful to keep track of all the backward incompatible changes. We have #336 but it's specifically for breaking changes related to SDK v0.11.2. It will probably be better if we have a document that we can keep adding things to. Shall I create a Google doc for this? |
@lyuyangh I think creating an issue on this repository focusing on DB migration is enough at this point. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
closes #308
previous version: