-
Notifications
You must be signed in to change notification settings - Fork 266
Using Summingbird Scalding
When you write summingbird jobs, the preferred style is to put all your code into a method like:
def job[P <: Platform[P]](lines: P#Source[String], counts: P#Store[String, Long]): TailProducer[P, Any] = ...
If you do this, your job is totally general over any platform. Then on the platform object, there is a method to plan a TailProducer.
The inputs are one of four types: Source, Service, Store, Sink. The first two are inputs, the last two are outputs. Let's look at what these four types are for summingbird-scalding:
The scalding platform internally uses a Monad, called PipeFactory to represent the plan for given Producer. PipeFactory is really just: PipeFactory[T]
is approximately
(DateRange, Mode) => Try[(DateRange, (FlowDef, Mode) => TypedPipe[T])]
Which means when planning we get a desired date range of output, and we try to get a plan that can cover a subset of the desired range, and a function to apply to a (FlowDef, Mode)
in scalding to get a TypedPipe. This will likely change in the future to use the scalding Execution monad, which did not exist when this code was written.
Sources are basically functions (DateRange) => Mappable[T]
and are generally created with: sourceFromMappable.
Services support the leftJoin operation. We are testing internal services which are created from Stores in the same job. Well supported are external services that come from other summingbird jobs, some existing service log dumps, or some custom logic you see fit to write.
Commonly a join key is unique: there is only one item in all history that will have that join key. For instance: userID. Only one user will have a given ID. Nothing fancy needs to be done to support joins like this: just join against the entire table that you have for users. When you have this kind of join, see: UniqueKeyedService.from. If you have some custom join logic, you might look at: SimpleService. In this case you implement a way to compute what is satisfiable by the service at a given moment, and how to do the join using standard scalding.
The next service type you might consider is when you have an existing summingbird job and you want to serve some store that is aggregated by that job. For that, see BatchedDeltaService. To use BatchedDeltaService you also need to materialize the key/values just before the sumByKey in the other job using a .write
.
Looking to the future (and available in the develop branch and possibly a published version since we all know wikis get way out of date), StoreService wraps a BatchedStore (see the next section) in order to give you something that is both a Service and a Store which one part of the job can write to, and another can read from.
TODO. Basically see VersionedStore for the way to do this.
TODO. Just write the pipe, but, like with Monads.