Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

move system outputs and analysis cases to Cloud Storage #335

Merged
merged 5 commits into from
Sep 20, 2022
Merged

Conversation

lyuyangh
Copy link
Member

@lyuyangh lyuyangh commented Sep 16, 2022

closes #308

  • System outputs and analysis cases are now stored in Cloud Storage so we can support datasets with large output files
    • System outputs for CNN daily mail can be submitted successfully now but the analysis still takes a very long time.
  • Data schema of system_outputs_xxx collection
    previous version:
    _id: ObjectId # a system generated ID
    system_id: str
    analysis_level: str
    data: BinData
    
    new version:
    _id: ObjectId # a system generated ID
    system_id: str
    analysis_level: str
    data: str  # blob name (identifier for the object stored on Cloud Storage
    
    • Code is backward compatible so no DB migration is required
  • .env needs to be updated to include the bucket name and the GCP project name
  • Cloud Storage authentication:
    • Local: authenticate with user accounts (will need to grant everyone access)
    • ECS environment: authenticate with service accounts. The credentials are stored in AWS Secrets Manager and the credentials are passed in the containers as environment variables when the containers are initialized.

@lyuyangh lyuyangh requested review from neubig and odashi September 16, 2022 16:14
@lyuyangh lyuyangh marked this pull request as draft September 16, 2022 16:46
@lyuyangh lyuyangh force-pushed the google-cloud branch 2 times, most recently from 4520585 to 7138a0d Compare September 20, 2022 00:54
@lyuyangh lyuyangh marked this pull request as ready for review September 20, 2022 01:02
Copy link
Contributor

@neubig neubig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks nice!

Following this PR, maybe we could write a script that pulls all the files out of the MongoDB and writes them to cloud storage?

I approve personally, think @odashi definitely needs to take a look at this before it's merged though.

insert_list = []
sample_list = [general_to_dict(v) for v in system_output_data.samples]
sample_compressed = SystemDBUtils._compress_data(sample_list)

blob_name = f"{system_id}/{SystemDBUtils._SYSTEM_OUTPUT_CONST}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how the GCP file system works, but would it be OK to, for example, have 1M files in one directory? If so then this is fine, but if not then we might want to have a hierarchical structure such as f"{system_id[0:3]}/{system_id}/{SystemDBUtils._SYSTEM_OUTPUT_CONST}"

Copy link
Member Author

@lyuyangh lyuyangh Sep 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't find a place in their doc where it explicitly says this but I think Cloud Storage (like S3) can basically hold any number of objects/files (S3 documentation says unlimited scalability). The only limit is that each object cannot exceed 5TB. So, I think this implementation is scalable.

re creating folders: Cloud Storage uses a flat namespace. If the object name contains "/"s, they are treated as part of the name. It does support folders but it basically creates an illusion for the user so it's easier to navigate the files with gsutil or on the web UI.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cloud Storage uses a flat namespace.

Correct. Cloud Storage (and some other storage services) adopts flat ("no folders") structure and all allowed characters (including /) are treated as a part of file name.

Cloud storage can store basically unlimited number of files as long as the data center allows. The only concern is the increase of cost of managing the data.

)
data = next(cursor)["data"]
# NOTE: (backward compatibility) data was stored in MongoDB previously,
# this can be removed after we regenerate DB next time.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Explaining what "this" is would make this more clear.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have updated the comment.

@lyuyangh
Copy link
Member Author

Thanks, @neubig! I implemented the code in a backward compatible way because I think it'll be easier to handle it in the code than migrate the DB. If we decide to migrate the DB with a script, I'll remove the code that's added for compatibility reasons.

@odashi
Copy link
Contributor

odashi commented Sep 20, 2022

@lyuyangh Could you prepare the plan of DB migration in some documents (or issues)? It is useful to understand how/when backward-incompatible changes will happen on this repository.

@@ -6,7 +6,7 @@ This repository includes code for frontend and backend of the ExplainaBoard web

[Schema Design](https://docs.google.com/document/d/1my-zuIYosrXuoqOk1SvqZDsC2tdMgs_A7HTAtPYXKLI/edit?usp=sharing)

## Quick Start
## Quick Start for Developers
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the "developer" here mean the developers of this software?
If it means any developers, it may involve some security risks (especially granting access to GCS) and I couldn't allow them to give authorization.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, developers of explainaboard_web. I don't think outside developers can actually contribute to this codebase at this point. As you mentioned, it's not secure to give them access to GCP and AWS so they'll need to build their own infrastructure (Cloud Storage, Cognito, MongoDB). If we want outside contributors, I guess we need a way for them to easily mock or set up local instances to replace these services.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, thanks!

backend/src/impl/storage.py Outdated Show resolved Hide resolved
backend/src/impl/storage.py Outdated Show resolved Hide resolved
backend/src/impl/init.py Outdated Show resolved Hide resolved
@lyuyangh
Copy link
Member Author

lyuyangh commented Sep 20, 2022

@odashi Thanks for the comments! I have fixed all of them.

We don't need to do a DB migration for this PR (it is backward compatible). But I agree, it'll be helpful to keep track of all the backward incompatible changes. We have #336 but it's specifically for breaking changes related to SDK v0.11.2. It will probably be better if we have a document that we can keep adding things to. Shall I create a Google doc for this?

@odashi
Copy link
Contributor

odashi commented Sep 20, 2022

@lyuyangh I think creating an issue on this repository focusing on DB migration is enough at this point.

Copy link
Contributor

@odashi odashi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@lyuyangh lyuyangh merged commit 01b0fb4 into main Sep 20, 2022
@lyuyangh lyuyangh deleted the google-cloud branch September 20, 2022 16:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Failing test case: summarization on cnn_dailymail
3 participants