Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

great work, how to improve? #3

Open
dberardo-com opened this issue May 3, 2021 · 1 comment
Open

great work, how to improve? #3

dberardo-com opened this issue May 3, 2021 · 1 comment

Comments

@dberardo-com
Copy link

Hi @johannestang thanks for the quick start repo you have made!

i am also thinking of setting up a big data stack for a docker cluster and possibly doing so using helm charts for k8s.

Before i start testing with your stack i would to ask a couple of questions to understand the direction to take when making possible improvements:

  • first off, what do you think should be improved first in this project? do you have a desired roadmap ahead?
  • secondly, if i understand correctly, the HIVE service is actually only needed as a requirement to use Presto to run SQL right? Is it possible to get rid of Hive if we only want to use Presto / Impala or is it not possible currently?
  • thirdly, in your blog post you state that "There are of course many other interesting big data SQL engines, e.g. Impala, Spark SQL, and Drill. For background on these (and more) have a look at this great post." Does it mean that if one wants to use Spark (and thus Spark SQL) one can use Hive and remove Presto from the stack, or would you still recommend connecting SparkSQL to Presto to run queries?

thanks for the clarification.

P.S. have you published any newer Blog Post since 2019 ? Let me know

@johannestang
Copy link
Owner

Hi @dberardo-com

Thanks for reaching out. I'm afraid it's been very long since I've worked on the components in this repo. Since putting together this stack I have also started using k8s, so that's probably the way I would go today. However, I'm not quite up to date with what's happened with the different projects in the stack, so there might be things that could/should be done differently today.

  1. Since the blog post (and no, I never got around to writing part 2), I worked on adding Kafka to the stack using Kafka Connect to persist the data streams to Minio/S3 and then being able to query historical data from S3 or live data from Kafka using Presto. I, however, hit a roadblock with a incompatibility between Confluent's schema registry and Hive (which might have been fixed since). So while most of it worked it never reached a state I was willing to make publicly available. But to answer your question, the next thing I would add is Kafka.

  2. I don't know - I at least didn't find a way of doing it. Things might be different today.

  3. Yes. If you were to use Spark then Presto could be removed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants