Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for non-image files #283

Open
benjamingeer opened this issue Feb 6, 2019 · 19 comments
Open

Support for non-image files #283

benjamingeer opened this issue Feb 6, 2019 · 19 comments
Assignees
Milestone

Comments

@benjamingeer
Copy link
Contributor

benjamingeer commented Feb 6, 2019

I'm trying to implement support for uploading PDF and CSV documents for our friends in Lausanne. I'd like to upload a PDF file to Sipi and store it in a temporary directory, then move it to a permanent directory. I can't use the directory tmp under imgroot, because then when I try to load the file from tmp (using just a normal URL, not a IIIF URL), Sipi tries to redirect to info.json. I guess this is because everything under imgroot is assumed to be an image.

So I guess I need another tmp directory under server. But then I have another problem: the filename hashing only works under imgroot. Could it be made to work under server as well?

Also, I'm wondering whether we couldn't just use the operating system's /tmp directory, which benefits from some optimisations (e.g. on Linux I think it can be cached in memory). We could make /tmp/sipi/images, /tmp/sipi/server, etc. Would this be possible?

@subotic
Copy link
Contributor

subotic commented Feb 6, 2019

We run Sipi inside a container. There is no system tmp folder per se that we can use. We would need to mount an external folder into tmp, but then it is the same thing as the others.

I would prefere to have a complete directory structure under a single folder as the default setting, e.g.,

assets
|- cache
|- images
|- server
|- tmp
|- whateverelse

so that only one folder needs to be mounted.

Also, for long-term preservation, we would need technical metadata. But this is a separate issue.

@benjamingeer
Copy link
Contributor Author

@subotic OK, I guess I misunderstood. I thought you said there was already a Docker mount point for /tmp, and you were surprised when I said that Knora’s Sipi scripts don’t use it.

@subotic
Copy link
Contributor

subotic commented Feb 6, 2019

On Travis we mount tmp, because it was needed for some tests. In production I forgot. I can always add it if necessary. Not hard to do. I would simply prefer if we could simplify it. Less stuff that can go wrong.

@loicjaouen
Copy link
Collaborator

/tmp used to be needed for what was called the non-gui upload case, isn't it the case anymore?

@benjamingeer
Copy link
Contributor Author

After discussion with @lrosenth and @subotic:

All content stored by Sipi will be under one directory, which could be called assets and will have project-specific directories under it, like this:

assets
   |- 0801
       |- A
       |- B
       |- C
          |- 1W6YRMj8VAT-GSQtJWgILX5.jp2
          |- 2pEDmjZo6X2-G8UovBGLixa.pdf
          |- 7phnClRcYeX-DxPQ7qgKZfA.csv
       |- D
       |- E
   |- 0803
       |- A
       |- B
       |- C
       |- D
       |- E
   |- tmp
       |- A
       |- B
       |- C
       |- D
       |- E
   |- cache

This makes it easier to move or back up a project's files, because there's just one directory of files per project. Sipi will determine the file type when the file is requested, and respond appropriately: if the base file URL is requested and the file isn't an image, Sipi will return the file instead of info.json.

@lrosenth expects to be able to do this next week.

@subotic
Copy link
Contributor

subotic commented Feb 8, 2019

There is also the case of storing images and only serving them over non-IIIF URLs, e.g., icons, watermarks, etc. Could this case also be covered?

@benjamingeer
Copy link
Contributor Author

storing images and only serving them over non-IIIF URLs, e.g., icons, watermarks, etc.

Couldn't these be served over IIIF URLs, too?

@subotic
Copy link
Contributor

subotic commented Feb 8, 2019

For icons and such, I guess so. @kilchenmann will now for sure. We only need to make sure, that all assets that are served through sipi are referenced somewhere in webapi (e.g., in the project), and only accessed by URLs provided by webapi. A client shouldn't be allowed to upload things to sipi without webapi knowing about it (eventually).

The only special case is the watermark image. I think it needs to be an absolute path to the image on disk.

@kilchenmann
Copy link
Contributor

For icons (for resource classes) we want to use an existing library and we store only the name of it. So, we don‘t need to upload an image there. But for a project logo we should use the iiif url.

@benjamingeer benjamingeer changed the title Support for subdirectories outside imgroot Support for non-image files Jul 5, 2019
@mrivoal
Copy link

mrivoal commented Jul 17, 2019

Re-posting my question here:

As discussed in the last developer meeting, we will need Knora and Sipi to handle PDF really soon now. From what I understood from @lrosenth, it doesn't involve a lot of work on Sipi...

If we validate/convert the PDF/A ourselves, when could we expect this support for non-image to be ready?

@lrosenth
Copy link
Collaborator

lrosenth commented Jul 17, 2019 via email

@benjamingeer
Copy link
Contributor Author

I’ll be on holiday during the first two weeks of August, and will be able to work on the Knora side of this when I get back.

@benjamingeer
Copy link
Contributor Author

Don’t forget we still need support for text files.

@benjamingeer
Copy link
Contributor Author

With the current Sipi, the /knora.json route doesn't return originalFilename or originalMimeType for a PDF file. This means that we have to make these properties optional for file values in knora-base. Is that what we want to do?

@subotic subotic added this to the Backlog milestone Feb 7, 2020
@lrosenth
Copy link
Collaborator

lrosenth commented Jul 2, 2020

There is no way to store the original filename and mime type within a PDF header since PDF's are treated as "blobs" when uploading. However, knora.json now returns the internal name and internal mime type (which is the same as the original since a PDF is not modified by SIPI) as originalFilenameand originalMimetype.
This could be a problem if a upload script changes the original name – but it does not break knora-base. The only workaround would be to use sidecar files but I consider this problematic...

@subotic
Copy link
Contributor

subotic commented Jul 3, 2020

In the future, we will need to find a way to store these kinds of information. Just based on my gut feeling, I think that this would be the job of dsp-api. For me, sipi is a media server and shouldn't be responsible for storing preservation metadata. This should be the job of dsp-api. So, everything that would go into a sidecar file, would go into dsp-api.

@benjamingeer
Copy link
Contributor Author

Knora already stores originalFilename and originalMimetype if Sipi provides that information.

@subotic
Copy link
Contributor

subotic commented Jul 3, 2020

So for PDFs sipi doesn't/cannot record the original filename. What if someone want's to upload different PDFs under the same name? Shouldn't any files uploaded get a unique name?

@benjamingeer
Copy link
Contributor Author

Knora's upload script always makes a random internalFilename for the uploaded file. The originalFilename is only remembered as part of the FileValue object in the triplestore.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants