Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds initial support for external attachments using the Filesystem or MinIO #1320

Open
wants to merge 53 commits into
base: main
Choose a base branch
from

Conversation

Spoffy
Copy link
Contributor

@Spoffy Spoffy commented Nov 24, 2024

This PR is still being worked on, but open to initial review.

Context

When attachments are uploaded to Grist, they're stored in an SQLite table within the document.

The result is that large files (multi-GB) substantially increase the size of the SQLite file, and cause performance problems (for example, with snapshots).

Proposed solution

The current design / full scope of changes can be seen in this Google document..

This PR contains a minimal proof of concept for external attachments, with the following features:

  • External attachment storage using either MinIO or the local filesystem.
  • A "default attachment store" setting for documents, which sets the destination for newly uploaded attachments.
  • Defaulting to existing behaviour. Without additional configuration, Grist's attachment storage behaviour is unchanged.
  • MinIO attachment storage is automatically made available if it's configured for document storage, but not automatically used by documents.

No UI exists for configuring anything related to attachment storage at this time. To use a store, the API or an SQLite client needs using to set the idOfDefaultAttachmentStore document setting to the ID of the desired attachment store. Attachment store IDs are logged to the console.

Currently only a single store of each backend type (minio, filesystem) can exist, as they're automatically configured using the existing setting.

Related issues

#1021

Review guidance / code overview

Three significant components are added in this PR, with the rest of the PR being generally glue around them:

  • AttachmentFileManager (in "AttachmentFileManager.ts") handles attachment fetching / saving logic on behalf of ActiveDoc
  • An IAttachmentStore is an instantiated store, that handles saving / loading attachments from a specific provider.
  • AttachmentStoreProvider is used to resolve store IDs to stores, and is the main way of providing other components (e.g AttachmentFileManager) with access to store instances.

The available types of attachment store are defined in coreCreator.ts, under the name "attachmentStoreBackends".

Additionally:

  • A few existing classes have been extended with streaming methods, to simplify interaction with the existing attachment code and to facilitate large files in the future without keeping copies in memory.
  • Tests may not pass yet, and additional unit tests are WIP.

Migrations / schema changes

  • One migration to _gristsys_files has been added, which adds a column to track which store a document is in.
  • idOfDefaultAttachmentStore has been added to the DocumentSettings JSON blob.

Both of these should be non-breaking on older installations, as both should be ignored.

Has this been tested?

Behaviour has been manually tested in the following ways:

  • Filesystem storage
  • MinIO client with MinIO storage
  • MinIO client with S3 storage

There are currently no unit tests. These should be added before merging, but the goal.

Recommended local testing setup

MinIO

  1. Set up Grist to use MinIO normally.
  2. Start Grist
  3. Retrieve the attachment store id from the console.
  4. Download the Grist document.
  5. Open the Grist document in a database client, and add "idOfDefaultAttachmentStore": "store id" to _grist_DocInfo -> documentSettings
  6. Upload the Grist document
  7. Try adding an attachment to the new document. The name should show up in MinIO's storage path on the local filesystem.

Known issues

  • When using S3 storage, attachments showed as broken until refreshing the page.

Copy link
Member

@paulfitz paulfitz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a batch of superficial comments from a skim. Feels like it is coming along pretty well!

@@ -2,6 +2,7 @@ export interface DocumentSettings {
locale: string;
currency?: string;
engine?: EngineCode;
idOfDefaultAttachmentStore?: string;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small nit: there's a lot of places in Grist where something could be called idOfFoo and is instead called fooId. I think that would make sense here too to rhyme with the rest of the codebase, defaultAttachmentStoreId.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(attachmentStoreId would be fine too, locale and currency are also defaults but we don't spell that out)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My motivation for phrasing it this way specifically was the ambiguity of the variable if it ends in id.

It could be:
defaultAttachmentStore Id
or
default attachmentStoreId

Which are semantically different interpretations of the same name! I'd be happy to swap it around if you think defaultAttachmentStoreId is clear enough though, despite the above :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

attachmentStoreId is my favorite now I'm afraid

Copy link
Contributor Author

@Spoffy Spoffy Nov 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good enough for me - it's less clear for purpose, but at least resolves the ambiguity. I wasn't a big fan of idOf anyway. 🙂

app/server/lib/ActiveDoc.ts Outdated Show resolved Hide resolved
@@ -2822,17 +2841,21 @@ export class ActiveDoc extends EventEmitter {
return this._dataEngine;
}

private async _getDocumentSettings(): Promise<DocumentSettings | undefined> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there are no settings, should that be an error? In what circumstances might that happen?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see the method gets used in a migration. That answers that. Might still make sense to throw an error, but accept that error as a non-erroneous possibility in the migration. Feels like in all other circs it should be an error? No big deal.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can switch this to throwing an error, then wrap the code below in _makeEngine in a try-catch block to handle this case?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that'd be fine

app/server/lib/AttachmentFileManager.ts Outdated Show resolved Hide resolved
app/server/lib/AttachmentFileManager.ts Outdated Show resolved Hide resolved
* Gets the correct pool id for a given document, given the document's id and trunk id.
* Avoids other areas of the codebase having to understand how documents are mapped to pools.
* This is a key security point - documents which share a pool can access each others' attachments.
* Therefore documents with different security domains (e.g from different teams) need to be sure not to share
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good now to talk about our strategy, for snapshots and forks to have the same pool id and share attachments?

* This is a general-purpose interface that should abstract over many different storage providers,
* so shouldn't have methods which rely on one the features of one specific provider.
*
* This is distinct from `ExternalStorage`, in that it's more generic and doesn't support versioning.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm "more generic" feels wrong here, since an interface mentioning docPoolIds: DocPoolId vs key: string feels much more specific in some ways. Not many things in the world have DocPoolIds. Maybe drop that text?

export class MinIOAttachmentStore implements IAttachmentStore {
constructor(
public id: string,
private _storage: MinIOExternalStorage,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the reason this is using MinIOExternalStorage specifically and won't work with S3ExternalStorage or AzureExternalStorage is new Stream methods, correct? It would still help I think to anticipate that developers will add those methods for those types of storage. It is totally fine just to implement for MinIO here, since this is core, but ideally the follow up work needed would be adding the methods and making small updates, and not doing a significant refactor. Trying to bear in mind for reminder of review to watch for specific dependencies on MinIO creeping in.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Read rest of PR. Feels fine, no MinIO creep 👍

Copy link
Contributor Author

@Spoffy Spoffy Nov 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

100% - I can't remember if I left the "Make an interface for streamable external storage" in a TODO or not somewhere, but it's planned, so we can turn "MinIOAttachmentStore" into "ExternalStorageAttachmentStore"

Copy link
Contributor Author

@Spoffy Spoffy Dec 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is done now - any external storage should be usable, if wrapped in ExternalStorageAttachmentStore, and made to implement the streaming interface

storeExists(id: AttachmentStoreId): Promise<boolean>;
}

export interface IAttachmentStoreBackendFactory {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extreme Java code vibes just here. If you search your heart, can you find a shorter name? I do see what you're going for.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked deep into my heart and gave it a shot, let me know if this is better!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks :) yes

app/server/lib/AttachmentStoreProvider.ts Show resolved Hide resolved
}

public async getFileData(fileIdent: string): Promise<Buffer | null> {
this._log.debug({ fileIdent }, "retrieving file data");
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm concerned this logging might be a bit much. When a document is opened, it loads attachment previews. That results in 1 or 2 log lines for every attachment in a document.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That might indeed be a bit much.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed 1 of the log lines - the other one is quite useful for debugging as it mentions the attachment pool, and which storage (document / external) the file is going into.

@Spoffy Spoffy marked this pull request as ready for review December 3, 2024 18:47
@jordigh jordigh self-assigned this Dec 3, 2024
Copy link
Member

@paulfitz paulfitz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is copying of documents with external attachments addressed? Might have missed it. If not, would want to be sure we don't miss that in follow-up.

@@ -636,6 +648,7 @@ export class ActiveDoc extends EventEmitter {
useExisting?: boolean, // If set, document can be created as an overlay on
// an existing sqlite file.
}): Promise<ActiveDoc> {
console.log(`OPENOPENOPENOPENOPENOPENOPENOPEN ${this.docName} ANYTHING ELSE`);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stray debugging message?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, this and others. Thanks for catching this 😅

@@ -388,6 +392,14 @@ export class ActiveDoc extends EventEmitter {
loadTable: this._rawPyCall.bind(this, 'load_table'),
});

// This will throw errors if _doc or externalAttachmentStoreProvider aren't provided, and ActiveDoc tries to use
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is _doc?

}

/**
* Instantiated on a per-document to basis to provide a document with access to its attachments.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

strike "to" in "to basis"?

// TODO - Decide if we keep an entry in SQLite after an upload error or not. Probably not?
await store.upload(this._getDocPoolId(), fileIdent, Readable.from(fileData));

// TODO - Confirm in doc storage that it's successfully uploaded? Need to decide how to handle a failed upload.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since external storage is not entirely under our control (an admin might reasonably delete some files there) we need to degrade gracefully if some attachments are mysteriously missing from expected location. Presumably that means that even without this TODO, a failure is non-catastrophic in the sense that the document will continue to work.

Copy link
Contributor Author

@Spoffy Spoffy Dec 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not well explained in this PR, but the transfer PR cleans up these errors, and explicitly re-orders operations to ensure we're always in a valid state (if the upload fails, the whole "Upload an attachment" flow fails, and no entries go into the DB).

In the case that the file isn't accessible in external storage, it should show that the attachment isn't available in the UI. I think that's handled on the clientside when the API returns an error response?

type FileId = string;


// Minimum required info from a document
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Minimum required info from a document" for what purpose?

* Generally, the pool id of a document should be its trunk id if available (because it's a fork),
* or the document's id (if it isn't a fork).
*
* This means that attachments need copying to a new pool when a document is copied.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we've talked explicitly about the consequences of document copying now being a potentially slow operation. As long as a copy fails cleanly if there's a problem, I think that would be the main thing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we have either - might be worth a chat?

I think the copy behaviour is missing from this PR to be honest (:grimacing:). It might be worth opening as a separate PR at this point?

_installationUuid: string
) {
// In the current setup, we automatically generate store IDs based on the installation ID.
// The installation ID is guaranteed to be unique, and we only allow one store of each backend type.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we only allow one store of each backend type - I thought we agreed to move on from this as too restrictive? It makes no sense to me to allow say an s3 bucket and an azure bucket but then to kick up a fuss if the azure bucket is replaced with a distinct s3 bucket.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember discussing permitting a small label that would take precedence over storeSpec.name if given.

Copy link
Contributor Author

@Spoffy Spoffy Dec 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've since discussed this, and will use a "JSON in environment variable" approach (adding comment here for posterity)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants