Does JuliaDB support distributed filesystems like HDFS or AWS S3? #328

schlichtanders · 2020-05-14T14:52:43Z

thanks for this awesome package!

It looks like JuliaDB can well be put into work over a distributed cluster of computation units, and is not limited to a single machine setup.

Having such a system, it would be natural to also load data in a distributed setting, e.g.

from all the different local filesystem
from HDFS directly
or from AWS S3 for instance

Take S3 for example, it would be really nice, if we could load data from an S3 folder with thousands of large parquet files, by downloading them onto separate processors, such that the data is spread over the cluster and we have a final JuliaDB table which encompasses all the data.

With HDFS one can imagine that you want to run JuliaDB on top of an AWS EMR cluster, where HDFS is already installed and you would like to load an HDFS folder with thousands of large parquet files. It would be awesome if JuliaDB could communicate with the HDFS (e.g. via Elly.jl) and load the data in a distributed manner from HDFS.

I guess the S3 example is most flexible, like having an intelligent distributed download into a distributed JuliaDB table.

Is something like this already possible with JuliaDB or on the roadmap?
Thanks a lot

schlichtanders · 2021-03-24T08:47:18Z

Is there any plan to support distributed file systems like HDFS or cloud storage? (e.g. S3)

Maybe this comes together with supporting a more flexible loadtable function, could this already be it?

jpsamaroo · 2021-03-24T12:40:08Z

It'd be a welcome addition, but I don't think anyone is actively looking at adding such a feature to JuliaDB, so you'll probably need to implement it yourself.

schlichtanders · 2021-03-24T13:40:29Z

Thank you so much for your reaction. Also for me it is a time question. I put it on my list of possible next julia projects.

Do you know whether there is a roadmap for JuliaDB? Like what is going to be implemented next?

jpsamaroo · 2021-03-24T14:37:10Z

JuliaDB is basically in maintenance mode right now. If there's going to be a future for this package, it will be because the community decides to make it so. The original developers appear to have moved on to other things, and are probably not likely to commit to large refactorings and feature additions, but they would probably help with PR review.

mahiki · 2022-01-23T05:20:47Z

I think it should be possible to use AWSS3.jl to provide abstract file paths into s3 for loading the data into the DB.

Though I haven't used JuliaDB yet, it seems like a great tool as a big data analytics engine. The serialized output format is the main limitation right now, until JuliaDB can write to a standard format like Arrow or Parquet or something the DB artifacts should be considered temporary workspace that will get regenerated from static files.

In this way JuliaDB can be used as a (extremely inexpensive) MPP DB engine like Redshift Spectrum or Spark. The "DB" is the stored files which so far cannot be written by JuliaDB. Yeah, now that I wrote this down this is a huge missing feature.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Does JuliaDB support distributed filesystems like HDFS or AWS S3? #328

Does JuliaDB support distributed filesystems like HDFS or AWS S3? #328

schlichtanders commented May 14, 2020

schlichtanders commented Mar 24, 2021

jpsamaroo commented Mar 24, 2021

schlichtanders commented Mar 24, 2021

jpsamaroo commented Mar 24, 2021

mahiki commented Jan 23, 2022

Does JuliaDB support distributed filesystems like HDFS or AWS S3? #328

Does JuliaDB support distributed filesystems like HDFS or AWS S3? #328

Comments

schlichtanders commented May 14, 2020

schlichtanders commented Mar 24, 2021

jpsamaroo commented Mar 24, 2021

schlichtanders commented Mar 24, 2021

jpsamaroo commented Mar 24, 2021

mahiki commented Jan 23, 2022