-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docs: ADR for serving static assets #110
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,154 @@ | ||
15. Serving Course Team Authored Static Assets | ||
============================================== | ||
|
||
Context | ||
-------- | ||
|
||
Both Studio and the LMS need to serve course team authored static assets as part of the authoring and learning experiences. "Static assets" in the edx-platform context presently refers to: image files, audio files, text document files like PDFs, older video transcript files, and even JavaScript and Python files. It does NOT typically include video files, which are treated separately because of their large file size and complex workflows (processing for multiple resolutions, using third-party dictation services, etc.) | ||
|
||
This ADR is the synthesis of various ideas that were discussed across a handful of pull requests and issues. These links are provided for extra context, but they are not required to understand this ADR: | ||
|
||
* `File uploads + Experimental Media Server #31 <https://github.com/openedx/openedx-learning/pull/31>`_ | ||
* `File Uploads + media_server app #33 <https://github.com/openedx/openedx-learning/pull/33>`_ | ||
* `Modeling Files and File Dependencies #70 <https://github.com/openedx/openedx-learning/issues/70>`_ | ||
* `Serving static assets (disorganized thoughts) #108 <https://github.com/openedx/openedx-learning/issues/108>`_ | ||
|
||
Data Storage Implementation | ||
~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
||
The underlying data models live in the openedx-learning repo. The most relevant models are: | ||
|
||
* `RawContent in contents/models.py <https://github.com/openedx/openedx-learning/blob/main/openedx_learning/core/contents/models.py>`_ | ||
* `Component and ComponentVersion in components/models.py <https://github.com/openedx/openedx-learning/blob/main/openedx_learning/core/components/models.py>`_ | ||
|
||
Key takeaways about how this data is stored: | ||
|
||
* Assets are associated and versioned with Components, where a Component is typically an XBlock. So you don't ask for "version 5 of /static/fig1.webp", you ask for "the /static/fig1.webp associated with version 5 of this block". | ||
* This initial MVP would be to serve assets for v2 content libraries, where all static assets are associated with a particular component XBlock. Later on, we'll want to allow courses to port their existing files and uploads into this system in a backwards compatible way. We will probably do this by creating a non-XBlock, filesystem Component type that can treat the entire course's uploads as a Component. The specifics for how that is modeled on the backend are out of scope for this ADR, but this general approach is meant to work for both use cases. | ||
* The actual raw asset data is stored in django-storages using its hash value as the file name. This makes it cheap to make many references to the same asset data under different names and versions, but it means that we cannot simply give direct links to the raw file data to the browser (see the next section for details). | ||
|
||
The Difficulty with Direct Links to Raw Data Files | ||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
||
Since the raw data is stored as objects in an S3-like store, and the mapping of file names and versions to that raw data is stored in Django models, why not simply have a Django endpoint that redirects a request to the named asset to the hash-named raw data it corresponds to? | ||
|
||
**It will break relative links between assets.** | ||
The raw data files exist in a flat space with hashes for names, meaning that any relative links between assets (e.g. a JavaScript file referencing an image) would break once a browser follows the redirect. | ||
|
||
**Setting Metadata: Content types, filenames, and caching.** | ||
The assets won't generally "work" unless they're served with the correct Content-Type header. For users who want to download a file, it's quite inconvenient if the filename doesn't include the correct file extension (not to mention a friendly name instead of the hash). So we need to set the Content-Type and/or Content-Disposition: ; filename=... headers. | ||
|
||
Setting these values for each request has proved problematic because some (but not all) S3-compatible storage services (including S3 itself) only support setting those headers for each request if you issue a signed GET request, which then gets in the way of caching and introduces the probability of browsers caching expired links, leading to all kinds of annoying cache invalidation issues. | ||
|
||
Setting the filename value at upload time also doesn't work because the same data may be referenced under different filenames by different Components or even different versions of the same Component. | ||
|
||
Application Requirements | ||
~~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
||
**Relative links between assets must be preserved.** | ||
Assets may reference each other in relative links, e.g. a JavaScript file that references images or other JavaScript files. That means that our solution cannot require querystring-based authorization tokens in the style of S3 signed URLs, since asset files would have no way to encode those into their relative links. | ||
|
||
**Multiple versions of the asset should be available at the same time.** | ||
Our system should be able to serve at minimum the current draft and published versions of an asset. Ideally, it should be able to serve any version of an asset. This is a departure from the way Studio and the LMS currently handle files and uploads, since there is currently no versioning at all–assets exist in a flat namespace at the course level and are immediately published. | ||
|
||
Security Requirements | ||
~~~~~~~~~~~~~~~~~~~~~ | ||
|
||
**Assets must enforce user+file read permissions at the Learning Context level.** | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. +1 to this as being a major req. |
||
The MongoDB GridFS backed ContentStore currently supports course-level access checks that can be toggled on and off for individual assets. Uploaded assets are public by default, and can be downloaded by anyone who knows the URL, regardless of whether or not they are enrolled in the course. They can optionally be "locked", which will restrict downloads to students who are enrolled in the course. | ||
|
||
**Assets should enforce more granular permissions at the individual Component level.** | ||
An important distinction between ContentStore and v2 Content Library assets is that the latter can be directly associated with a Component. As a long term goal, we should be able to make permissions check on per-Component basis. So if a student does not have permission to view a Component for whatever reason (wrong content group, exam hasn't started, etc.), then they should also not have permission to see static assets associated with that component. | ||
|
||
The further implication of this requirement is that *permissions checking must be extensible*. The openedx-learning repo will implement the details of how to serve an asset, but it will not have the necessary models and logic to determine whether it is allowed to. | ||
|
||
**Assets must be served from an entirely different domain than the LMS and Studio instances.** | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Does this requirement apply to all Open edX platforms, and not just edX.org? We can't possibly expect Open edX users to register a second domain name to host their platforms. At the very least, this item should be discussed with the BTR. Personally, I would oppose such a requirement. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @regisb: It would apply to all Open edX platforms. For now, we'll use a different sub-domain for development purposes (so it would come across like a new service). There are still some long term security risks with that, but it's better than the current state of things. I'll put a writeup together on security tradeoffs before bringing this to the BTR and likely the security working group. We won't add a hard requirement of a new domain for assets before doing that feedback cycle. That being said, right now I'm really trying to unblock some folks on libraries work, and then I'm taking time off to go back to my hometown for the first time since the pandemic. It's likely that I won't have the time to go through this process until next month. Thank you for your feedback. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. FYI @bradenmacdonald, @kdmccormick: ^ There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Thanks Dave, I'll keep an eye out for that.
This is a good point, and it makes me wonder if a sibling subdomain would be sufficient for security. For example, if a site operator today had:
could we move them to:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. A sibling subdomain only works if you are extremely careful to never set cookies onto the root domain. If you look at tutor today, for example, many cookies like It's much safer to use a completely unrelated domain. This is exactly why GitHub puts Pages on
To be clear, registering a second domain name is usually not going to be necessary except for aesthetic reasons. Almost all production deployments happen on the cloud and virtually all cloud providers provide free domain names on public suffixes, e.g. |
||
To reduce our chance of maliciously uploaded JavaScript compromising LMS and Studio users, user-uploaded assets must live on an entirely different domain from LMS and Studio (i.e. not just another subdomain). So if our LMS is located at ``sandbox.openedx.org``, the files should be accessed at a URL like ``assets.sandbox.openedx.io``. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I fail to see how this security issue would be mitigated by hosting the javascript files on a different domain. It seems to me that the cookie access capability of custom javascript depends on the runtime context, not the hosting domain. For instance, javascript loaded from abc.com, but executed on def.com, would be able to access the cookies from abc.com. What am I missing? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good point. Actually this is only required assuming that content is being run in an iframe, and the main source code file (HTML) for the iframe is being hosted among the static assets. It's true that as long as the iframe has a different origin than the LMS, it doesn't matter where the actual asset files are stored. I think we need to think this through a bit more. CC @ormsbee There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. FWIW, people can (and occasionally do) upload HTML files as static assets. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Right. I probably don't have the whole picture, but it seems to me that this is a very niche use case. It would be a shame if this specific scenario by itself forced us to setup another domain name. Is this a use case that we want to preserve, or it a new one? In its current form, isn't it already a security liability, as it allows course staff to run arbitrary scripts? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
It is a liability. One we're hoping to close here. Whether these use cases should be preserved is a product call. I've seen HTML uploads used by folks who have some fancy JS simulation already built out from some other project, and they want to run it in their course. I've also seen it used by people who have some kind of syllabus from a Word export, where all the images are base64-encoded data urls embedded into the HTML itself, leading to a 20 MB file that would choke the HTMLBlock editor. I'd be delighted to kick those cases to the curb, but I don't know what other vectors there are. PDFs run JavaScript for form validation these days. I think that's locked down fairly tight and wouldn't have access to cookies, but I'm not sure. A more serious vulnerability is that JS can be embedded into an SVG file. Browsers won't run it when you include SVGs using the Now there are ways to sanitize it, set up proper content security policies, etc. But I guess my point is that I don't know what else is out there, or will be out there five years from now. Using a separate domain helps to remove this class of vulnerability, which is probably why places like MDN recommend storing it on a different domain:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sorry for the slow answer... Now I understand the general recommendation about hosting uploaded assets on a different subdomain or domain. But that should be a recommendation, not a requirement. It's a whole different thing to say "we recommend you host your uploaded assets on a different domain, but if you understand the risks you can use a subdomain of your LMS, or even the same domain" -- and to actually support those use cases. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If we do end up going the route of having this be a recommendation instead of a hard requirement, I think we would still want the implementation with two domains to still be the default from a "security by default" perspective. I don't mind people choosing to not do this but I don't want the software to make it easy to do the "wrong thing." |
||
|
||
Operational Requirements | ||
~~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
||
**The asset server must be capable of handling high levels of traffic.** | ||
Django views are poor choice for streaming files at scale, especially when deploying using WSGI (as Open edX does), since it will tie down a worker process for the entire duration of the response. While a Django-based streaming response may sufficient for small-to-medium traffic sites, we should allow for a more scalable solution that fully takes advantage of modern CDN capabilities. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As a side note, this is not the case for uwsgi, which can serve static assets without pausing workers. This is the primary reason why tutor uses uwsgi. But on the other hand uwsgi is no longer maintained, and thus we find ourselves shopping for a replacement, hoping that we can find one with the same feature set. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sure. But even if uwsgi can send files efficiently when its file serving subsystem is invoked, a Django view is still going to tie down the worker process because it's Django doing the streaming in that case. I don't know the specifics of how you'd offload the file serving from the Django view to uwsgi (I'm sure it's possible), but at that point, it's functionally equivalent to using the |
||
|
||
**Serving assets should not *require* ASGI deployment.** | ||
Deploying the LMS and Studio using ASGI would likely substantially improve the scalability of a Django-based streaming solution, but migrating and testing this new deployment type for the entire stack is a large task and is considered out of scope for this project. | ||
|
||
Decision | ||
-------- | ||
|
||
URLs | ||
~~~~ | ||
|
||
The format will be: ``https://{asset_server}/assets/apps/{app}/{learning_package_key}/{component_key}/{version}/{filepath}`` | ||
|
||
The assets will be served from a completely different domain from the LMS and Studio, and will not be a subdomain. | ||
|
||
A more concrete example: ``https://studio.assets.sandbox.openedx.io/apps/content_libraries/lib:Axim:200/xblock.v1:problem@826eb471-0db2-4943-b343-afa65a6fdeb5/v2/static/images/fig1.png`` | ||
|
||
The ``version`` can be: | ||
|
||
* ``draft`` indicating the latest draft version (viewed by authors in Studio). | ||
* ``published`` indicating the latest published version (viewed by students in the LMS) | ||
* ``v{num}`` meaning a specific version–e.g. ``v20`` for version 20. | ||
|
||
Asset Server Implementation | ||
~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
||
There will be two asset server URLs–one corresponding to the LMS and one corresponding to Studio, each with their own subdomain. An example set of domains might be: | ||
|
||
* LMS: ``sandbox.openedx.org`` | ||
* Studio: ``studio.sandbox.openedx.org`` | ||
* LMS Assets: ``lms.assets.sandbox.openedx.io`` (note the ``.io`` top level domain) | ||
* Studio Assets: ``studio.assets.sandbox.openedx.io`` | ||
|
||
The asset serving domains will be serviced by a Caddy instance that is configured as a reverse proxy to the LMS or Studio. Caddy will be configured to only proxy a specific set of paths that correspond to valid asset URLs. | ||
|
||
Django View Implemenation | ||
~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
||
The LMS and Studio will each have one or two apps that implement view endpoints by extending a view that will be provided by the Learning Core. These views will only respond to requests that come via the asset domains (i.e. they will not work if you request the same paths using the LMS or Studio domains). | ||
|
||
Django is poorly suited to serving large static assets, particularly when deployed using WSGI. Instead of streaming the actual file data, the Django views serving assets will make use of the ``X-Accel-Redirect`` header. This header is supported by both Caddy and Nginx, and will cause them to fetch the data from the specified URI to send to the user. This redirect happens internally in the proxy and does *not* change the browser address. For sites using an object store like S3, the Django view will generate and send a signed URL to the asset. For sites using file-based Django media storage, the view will send a URL that Caddy or Nginx knows how to load from the file system. | ||
|
||
The Django view will also be responsible for setting other important header information, such as size, content type, and caching information. | ||
|
||
Permissions | ||
~~~~~~~~~~~ | ||
|
||
The Learning Core provided view will contain the logic for looking up and serving assets, but it will be the responsibility of an app in Studio or the LMS to extend it with permissions checking logic. This logic may vary from app to app. For instance, Studio would likely implement a simple permissions checking model that only examines the learning context and restricts access to course staff. LMS might eventually use a much more sophisticated model that looks at the individual Component that an asset belongs to. | ||
|
||
Cookie Authentication | ||
~~~~~~~~~~~~~~~~~~~~~ | ||
|
||
Authentication will use a session cookie for each asset server domain. | ||
|
||
Assets that are publicly readable will not require authentication. | ||
|
||
Asset requests may return a 403 error if the user is logged in but not authorized to download the asset. They will return a 401 error for users that are not authenticated. | ||
|
||
There will be a new endpoint exposed in LMS/Studio that will force a redirect and login to the asset server. Pages that make use of assets will be expected to load that endpoint in their ``<head>`` before any page assets are loaded. The flow would go like this: | ||
|
||
#. There is a ``<script>`` tag that points to a new check-login endpoint in LMS/Studio, causing the browser to load and execute it before images are loaded. | ||
#. This LMS/Studio endpoint generates a random token, stores user information its backend cache based on that token, and redirects the user to an asset server login endpoint using that token as a querystring parameter. | ||
#. The asset server endpoint checks the cache with that token for the relevant user information, logs that user in, and removes the cache entry. It has access to the cache because it's still proxying to the same LMS/Studio process underneath–it's just being called from a different domain. | ||
|
||
Masquerading | ||
~~~~~~~~~~~~ | ||
|
||
We could theoretically take masquerading into account during the auto-login process for the asset server, but we would not implement it in the first iteration. | ||
|
||
Rejected Alternatives | ||
--------------------- | ||
|
||
Per-asset Login Redirection | ||
~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
||
It is possible to initiate a series of redirects for every unauthenticated request to a non-public asset. This remove the need for pages using assets to have to include this special handling in their ``<head>``. Some drawbacks of this approach: | ||
|
||
* Injecting tokens in the querystrings of assets may cause errors or security leaks. | ||
* Combining per-asset redirection with dedicated endpoints for the tokens would mean even more redirection, increasing the number of places where things could fail. | ||
* There is a greater risk of bugs causing infinite loops. | ||
* A page that loads many assets concurrently may trigger a large set of duplicated redirects/logins. | ||
|
||
Forcing the page to opt into asset authentication is unusual and may cause bugs. But the hope is that it is operationally safer and simpler, and that the number of views that directly render non-public assets will be relatively small. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see why this is useful, but I want to dig into this a little bit, because this is inherently a costly choice, right? I am curious: what is the process by which an author augments a file to create a new version? Is it re-upload with the same name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, basically. If you have a Problem and
/static/figure1.webp
is reference, and you upload a new/static/figure1.webp
, then you're creating a new version of that XBlock with that file updated.It is somewhat costly, since we're holding onto both the old and new images in this case. But at the same time, it gives us the ability to actually batch changes in XBlock content and files together, e.g. for publishing them both at the same time.