Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Does JuliaDB support distributed filesystems like HDFS or AWS S3? #328

Open
schlichtanders opened this issue May 14, 2020 · 5 comments
Open

Comments

@schlichtanders
Copy link

thanks for this awesome package!

It looks like JuliaDB can well be put into work over a distributed cluster of computation units, and is not limited to a single machine setup.

Having such a system, it would be natural to also load data in a distributed setting, e.g.

  • from all the different local filesystem
  • from HDFS directly
  • or from AWS S3 for instance

Take S3 for example, it would be really nice, if we could load data from an S3 folder with thousands of large parquet files, by downloading them onto separate processors, such that the data is spread over the cluster and we have a final JuliaDB table which encompasses all the data.

With HDFS one can imagine that you want to run JuliaDB on top of an AWS EMR cluster, where HDFS is already installed and you would like to load an HDFS folder with thousands of large parquet files. It would be awesome if JuliaDB could communicate with the HDFS (e.g. via Elly.jl) and load the data in a distributed manner from HDFS.

I guess the S3 example is most flexible, like having an intelligent distributed download into a distributed JuliaDB table.

Is something like this already possible with JuliaDB or on the roadmap?
Thanks a lot

@schlichtanders
Copy link
Author

Is there any plan to support distributed file systems like HDFS or cloud storage? (e.g. S3)

Maybe this comes together with supporting a more flexible loadtable function, could this already be it?

@jpsamaroo
Copy link
Collaborator

It'd be a welcome addition, but I don't think anyone is actively looking at adding such a feature to JuliaDB, so you'll probably need to implement it yourself.

@schlichtanders
Copy link
Author

Thank you so much for your reaction. Also for me it is a time question. I put it on my list of possible next julia projects.

Do you know whether there is a roadmap for JuliaDB? Like what is going to be implemented next?

@jpsamaroo
Copy link
Collaborator

JuliaDB is basically in maintenance mode right now. If there's going to be a future for this package, it will be because the community decides to make it so. The original developers appear to have moved on to other things, and are probably not likely to commit to large refactorings and feature additions, but they would probably help with PR review.

@mahiki
Copy link

mahiki commented Jan 23, 2022

I think it should be possible to use AWSS3.jl to provide abstract file paths into s3 for loading the data into the DB.

Though I haven't used JuliaDB yet, it seems like a great tool as a big data analytics engine. The serialized output format is the main limitation right now, until JuliaDB can write to a standard format like Arrow or Parquet or something the DB artifacts should be considered temporary workspace that will get regenerated from static files.

In this way JuliaDB can be used as a (extremely inexpensive) MPP DB engine like Redshift Spectrum or Spark. The "DB" is the stored files which so far cannot be written by JuliaDB. Yeah, now that I wrote this down this is a huge missing feature.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants