-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
testing data migrations of editable notes feature #396
Comments
of 131 notes that lack markdown, 57 were found statically hosted, downloaded, and saved into the database. The other 74 failed to be found or converted. It's worth checking some of these failures to see if they can be found on the beta S3 but are not loaded due to encoding errors in the name, or if they really are missing. |
Many slugs reported missing do not appear to be in the note html upload. Such is the staging system, a wash of random things that come and go and incoherence. The slug If there is static content, should that override the markdown? |
For confirmation, I looked for how many rows in note_markdown that had non-zero/non-null HTML, but zero/null markdown. There were 57. That lines up with the number of static files that were downloaded. I also looked at the length of the HTML, and it was ... nothing I would not expect. It'll be hard to confirm this without merging into the largely complete edit branch. |
The HTML coming from the static hosted notes should be sanitized using @yourcelf 's sanitization code prior to being inserted into the database. Need to update the data migration for that. |
This will be a good comparison for downloading from static. Unfortunately #397 is preventing me from loading the URL at the moment. |
A quick note: if the |
@yourcelf South's ORM is in full effect. I might be able to import the models and call the correct functions in clever and tricky ways. So long as they don't touch the database, everything should work as expected. |
Rather than get too fancy with importing I'd just copy/paste code in this case. The relevant parts is rather simple:
(will need to copy out definition of Django does so much magic in loading the models that I would be leery of importing them within South without consequences. Migrations are also, at least in theory, supposed to be a historical record of changes -- and if we change the |
I was meaning to call the code statically, but if the code is that simple, I agree copy/paste is better. |
*statically as in unbound methods being shoehorned into other objects as bound methods. very hackery. |
Testing against http://localhost:8080/note/harvard/cs-50-114/1st-lecture using the latest commit off the There is html is in the
On beta's system, the link looks like this: On my test system, the note appears blank. Browser complains that Ah, I bet I'm using old static content! I need to |
There seems to be a disconnect between where |
I'm using beta's info, which is good because I'm basically testing what deployment will look like on the staging system. Link serving the static file: I'll have to check two things:
|
One thing to try: ask django to clear its cache. This ought to do that:
|
I thought maybe it was just a delay from S3 to cloudfront, but the files are still not updated. Trying to clear the cache.
and I'll try another |
It probably wasn't the django cache, since that doesn't get in between cloudfront.net requests and the browser. I'm looking at S3 and I see Time to learn about how CloudFront CDN works. |
CloudFront for beta has the correct S3 bucket as its origin, but it was last modified April 2014. I don't see any big button to pull from the origin. Problem is nailed down at least. Definitely need to learn this for updating prod later. |
Ah, the issue is with how caching is performed across the CDN. The assumption is that each file is unique and independent ... and permanent. In order to upload "new" files, you're supposed to upload a file with a unique name. Versioning file names is highly recommended. According to this SO, there is an invalidation method. Need to POST an invalidation request to invalidate existing files. Unsure if they are then automagically re-discovered, or if there is a second step. |
(The reason the cache might be implicated: though it doesn't get between the browser and CDN, compress uses the cache to decide which file to retrieve from the CDN. Cache clearing might be needed if the symptom is that the CDN receives the new files (named by the correct new md5 hash), but the page still renders requesting the old files (named by the old md5 hash)) |
Watching the network transfer logs, the file name is "note-detail.js". In fact, every single file appears to be downloaded individually. In fact, on the s3 bucket, where I don't see any effect of compress in action (hence my ticket about it in #397) |
Got it. Looks like compress is running with This could be either because |
Dev boxes, which are designed to be very, very different from staging and production, do not use However, staging, production, and my VM are all running |
Just for confirmation, beta has |
Moving compress conversation back to ticket #397 I could try my hand at file invalidation in the meantime, but it will be a painstaking and manual process that cannot be automated in the future. |
While testing ...
<div class="row">
<div class="small-12 small-centered columns medium-12 large-12 body_copy">
<div id="note-content-wrapper" class="note-text">
<div class='note-html' id='note-html'>
<!DOCTYPE html>
<!-- Created by pdf2htmlEX (https://github.com/coolwanglu/pdf2htmlex) -->
<html>
<head>
<meta charset="utf-8">
<meta name="generator" content="pdf2htmlEX">
<meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1">
<style type="text/css">
/*!
* Base CSS for pdf2htmlEX
* Copyright 2012,2013 Lu Wang <[email protected]>
* https://github.com/coolwanglu/pdf2htmlEX/blob/master/share/LICENSE
*/#sidebar{position:absolute;top:0;left:0;bottom:0;width:250px;padding:0;margin:0;overflow:auto}#page-container{position:absolute;top:0;left:0;margin:0;padding:0;border:0}@media screen{#sidebar.opened+#page-container{left:250px}#page-container{bottom:0;right:0;overflow:auto}.loading-indicator{display:none}.loading-indicator.active{display:block;position:absolute;width:64px;height:64px;top:50%;left:50%;margin-top:-32px;margin-left:-32px}.loading-indicator img{position:absolute;top:0;left:0;bottom:0;right:0}}@media print{@page{margin:0}html{margin:0}body{margin:0;-webkit-print-color-adjust:exact}#sidebar{display:none}#page-container{width:auto;height:auto;overflow:visible;background-color:transparent}.d{display:none}}.pf{position:relative;background-color:white;overflow:hidden;margin:0;border:0}.pc{position:absolute;border:0;padding:0;margin:0;top:0;left:0;width:100%;height:100%;overflow:hidden;display:block;transform-origin:0 0;-ms-transform-origin:0 0;-webkit-transform-origin:0 0}.pc.opened{display:block}.bf{position:absolute;border:0;margin:0;top:0;bottom:0;width:100%;height:100%;-ms-user-select:none;-moz-user-select:none;-webkit-user-select:none;user-select:none}.bi{position:absolute;border:0;margin:0;-ms-user-select:none;-moz-user-select:none;-webkit-user-select:none;user-select:none}@media print{.pf{margin:0;box-shadow:none;page-break-after:always;page-break-inside:avoid}@-moz-document url-prefix(){.pf{overflow:visible;border:1px solid #fff}.pc{overflow:visible}}}.c{position:absolute;border:0;padding:0;margin:0;overflow:hidden;display:block}.t{position:absolute;white-space:pre;font-size:1px;transform-origin:0 100%;-ms-transform-origin:0 100%;-webkit-transform-origin:0 100%;unicode-bidi:bidi-override;-moz-font-feature-settings:"liga" 0}.@CSS_LINK_CN span{position:relative;vertical-align:baseline;display:inline-block;unicode-bidi:bidi-override}._{color:transparent;z-index:-1}::selection{background:rgba(127,255,255,0.4)}::-moz-selection{background:rgba(127,255,255,0.4)}.pi{display:none}.d{position:absolute;transform-origin:0 100%;-ms-transform-origin:0 100%;-webkit-transform-origin:0 100%}</style>
<style type="text/css">
... |
@yourcelf I wasn't sure when to do the PDF stuff, before or after the preserve formatting, or if preserve formatting does not run on PDFs at all. Here's the path I took (constants are copy/pasted out of Note). # clean the html in a consistent way with note uploads as of the
# time of this migration.
if note.mimetype in PDF_MIMETYPES:
# handle embedded images from pdf2htmlEX
html = sanitizer.data_uris_to_s3(html)
if note.category in EDITABLE_CATEGORIES:
# make HTML editable
html = sanitizer.sanitize_html_to_editable(html)
else:
# clean up HTML without concern for editing
html = sanitizer.sanitize_html_preserve_formatting(html) |
The end goal is to have every note model have an associated The main pipelines are currently:
We're talking here about a 3rd, new pipeline, which is "migrate all old notes with static-hosted HTML to inlineable database-hosted HTML." So the decision pipeline for me would go something like this:
|
Interesting stats from the migration of the staging static hosted notes:
Either something is wrong with
I'm curious what the breakdown of mimetypes is for the failed notes. I should probably find out. |
For some reason I assumed PDF notes would not be categorized as (editable) lecture notes, but in fact, 5 were. I'm not sure what to do about this situation. The output from pdf2htmlEX is not friendly for editing. @yourcelf , @AndrewMagliozzi Any opinion on this? Should we change the category of "Lecture Notes" which were PDFs to something non-editable like study guide or other? Should we create a new category like "Lecture Notes (readonly)" ? |
|
@yourcelf oh, so we can make pdf2htmlEX output editable? I misunderstood from our conversation about it. That simplifies the issue considerably! If I understand correctly, Does the order of filtering matter? I'm still not clear if I do branch on |
Yes, all of The advantage of putting it in Regarding ordering: if I remember right, both flavors of |
@yourcelf Ah, good to know about idempotence and ordering. In that case, I think my code is right with a small change. I'll run Great! |
Testing It's there. No javascript errors. Doesn't even look too bad. Yikes, ton of CORS errors blocking static elements along these lines:
Need to find an editable migrated note (of the 5). |
Of the 5 editable notes on the staging system which were migrated from static notes, 4 are the same document. So two unique documents.
They are there. They did not translate very well, but that is to be expected. All the content does appear in place, and there are some images. No console errors about CORS, but there is this error The WYSIWYG editor is accessible, but it is showing raw HTML. There is extraneous spacing and empty tags. Removing some of this extraneous stuff works fine. Trying to add a bold tag to some plain text causes no particular errors on the console, but seems to reload the page rather than changing the text. This is true of both texts. |
Here's a quick view of the production database stats. Of 250 editable notes (category: Lecture Notes), only 6 are
|
Same call on staging just to see how the distributions align. Looks like each type seen on production exists somewhere on staging. It'll be worth checking how these all react to being downloaded from static to consider whether the original filepicker or gdrive file should be downloaded instead, or if it works well enough as-is.
|
|
Managed to pick editable versions of everything but Both uhm so I guess this ticket has migrated into testing the note editing data migrations, not just the static notes. |
Reverting the static files and code for the staging system so I can test the images to see how it handles them. http://karmanotes-beta.herokuapp.com/note/harvard/reason-and-faith-in-the-west/imagejpg-4-23-317946 shows an image (and apparently displays some poor OCR). The correct URL for the image should be https://lh4.googleusercontent.com/m7DjSVOLh6Si5TlMQoz3O64-j2pB5CFIRKgniUwBkrR-PgIym3LwlEmmUZeafDNyLJSkuOSWgwIRdZsCiszmgUICC3tJu_KyQhtHa2NoT_197hGQCBEXJtkDVVPe008M but it is extremely truncated in the new code base version, leading to a bad file link. |
In the markdown version in the database, the URL looks like this: ![](https://lh4.googleusercontent.com/m7DjSVOLh6Si5TlMQoz3O64
-j2pB5CFIRKgniUwBkrR-PgIym3LwlEmmUZeafDNyLJSkuOSWgwIRdZsCiszmgUICC3tJu_KyQhtHa
2NoT_197hGQCBEXJtkDVVPe008M)
* * *
FICTION there is a line break in the URL which seems to align with the broken image link. |
Ah the line break doesn't bother markdown, but the conversion from markdown to HTML drops everything after the line break. <p><img src="https://lh4.googleusercontent.com/m7DjSVOLh6Si5TlMQoz3O64"></p>
<hr>
<p>FICTION </p> |
Forked line break issue to #404 . Data migrations are not yet ready for staging or production. |
Retesting all links above.
So everything looks good. Except the pdf link. That spawns #406 . Since there is a true overlap between uploaded PDFs and Lecture Notes (which are editable), we need to ensure pdf2htmlex output can be edited with wysihtml. Right now something awkward is happening. |
Alright, it looks like the only stopper to finishing this ticket off was actually bug related to FF and our implementation of WYSIHTML, and nothing at all to do with migrations. I think the data migrations look good so far. I'm ready to push this all to the staging system beta for testing by live users. However, there might be problems with users running FF when editing notes. |
woops no reason to keep this ticket open. data migrations look good. any further problems found by live users or whatever will be new tickets. |
data migration is written in commit 67834da but as yet untested.
needs to be tested against beta's database.
The text was updated successfully, but these errors were encountered: