great work, how to improve? #3

dberardo-com · 2021-05-03T09:47:01Z

Hi @johannestang thanks for the quick start repo you have made!

i am also thinking of setting up a big data stack for a docker cluster and possibly doing so using helm charts for k8s.

Before i start testing with your stack i would to ask a couple of questions to understand the direction to take when making possible improvements:

first off, what do you think should be improved first in this project? do you have a desired roadmap ahead?
secondly, if i understand correctly, the HIVE service is actually only needed as a requirement to use Presto to run SQL right? Is it possible to get rid of Hive if we only want to use Presto / Impala or is it not possible currently?
thirdly, in your blog post you state that "There are of course many other interesting big data SQL engines, e.g. Impala, Spark SQL, and Drill. For background on these (and more) have a look at this great post." Does it mean that if one wants to use Spark (and thus Spark SQL) one can use Hive and remove Presto from the stack, or would you still recommend connecting SparkSQL to Presto to run queries?

thanks for the clarification.

P.S. have you published any newer Blog Post since 2019 ? Let me know

johannestang · 2021-05-25T14:03:11Z

Hi @dberardo-com

Thanks for reaching out. I'm afraid it's been very long since I've worked on the components in this repo. Since putting together this stack I have also started using k8s, so that's probably the way I would go today. However, I'm not quite up to date with what's happened with the different projects in the stack, so there might be things that could/should be done differently today.

Since the blog post (and no, I never got around to writing part 2), I worked on adding Kafka to the stack using Kafka Connect to persist the data streams to Minio/S3 and then being able to query historical data from S3 or live data from Kafka using Presto. I, however, hit a roadblock with a incompatibility between Confluent's schema registry and Hive (which might have been fixed since). So while most of it worked it never reached a state I was willing to make publicly available. But to answer your question, the next thing I would add is Kafka.
I don't know - I at least didn't find a way of doing it. Things might be different today.
Yes. If you were to use Spark then Presto could be removed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

great work, how to improve? #3

great work, how to improve? #3

dberardo-com commented May 3, 2021

johannestang commented May 25, 2021

great work, how to improve? #3

great work, how to improve? #3

Comments

dberardo-com commented May 3, 2021

johannestang commented May 25, 2021