Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support data URIs in load_url #498

Open
soxofaan opened this issue Feb 28, 2024 · 3 comments
Open

Support data URIs in load_url #498

soxofaan opened this issue Feb 28, 2024 · 3 comments

Comments

@soxofaan
Copy link
Member

soxofaan commented Feb 28, 2024

load_url currently only supports HTTP(S) URLs:

"name": "url",
"description": "The URL to read from. Authentication details such as API keys or tokens may need to be included in the URL.",
"schema": {
"title": "URL",
"type": "string",
"format": "uri",
"subtype": "uri",
"pattern": "^https?://"

For a use case we were brainstorming about avoiding the overhead of creating/managing external URLs (for a lot of small files) and came to the idea to load from data URLs where the data can be embedded in base64 inside the process graph, without need for external files/URLs. E.g.

  "lu": {
    "process_id": "load_url",
    "arguments": {
      "url": "data:application/vnd.apache.parquet;base64,UEFSMRUEFRAVFEwVAhUAEgAACBwqAAAAAAAAAB...",  
  • additional work for a back-end wouldn't be that much
  • would also be a handy feature in clients to allow embedding local data directly in the process graph
@soxofaan
Copy link
Member Author

soxofaan commented Apr 4, 2024

I can cook up a PR for this is there is more interest for this feature

@clausmichele
Copy link
Member

Interesting! There will be the risk to create heavy process graphs probably, but the same happens with inline geoJSON anyway. Good to have another option and happy to try to support it.

@soxofaan
Copy link
Member Author

soxofaan commented Apr 5, 2024

Indeed, we already had issues with users embedding huge GeoJSON constructs in their process graph, so this would not create a new problem. As a matter of fact, the textual representation of GeoJSON makes it very space-inefficient and data URLs could improve the situation because of binary encoding and compression.

But still, it could be the responsibility of the clients to put reasonable thresholds on this and warn about or forbid excessive payloads

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants