Skip to content

Multilang protocol

nathanmarz edited this page Oct 15, 2011 · 8 revisions

Storm Multi-Language Protocol

The ShellBolt

Support for multiple languages is implemented via the ShellBolt class. This class implements the IBolt interfaces and implements the protocol for executing a script or program via the shell using Java's ProcessBuilder class.

Output fields

Output fields are part of the Thrift definition of the topology. This means that when you multilang in Java, you need to create a bolt that extends ShellBolt, implements IRichBolt, and declared the fields in declareOutputFields. You can learn more about this on [Concepts]

Protocol Preamble

A simple protocol is implemented via the STDIN and STDOUT of the executed script or program. A mix of simple strings and JSON encoded data are exchanged with the process making support possible for pretty much any language.

Packaging Your Stuff

To run a ShellBolt on a cluster, the scripts that are shelled out to must be in the resources/ directory within the jar submitted to the master.

However, During development or testing on a local machine, the resources directory just needs to be on the classpath.

The Protocol

Notes:

  • Both ends of this protocol use a line-reading mechanism, so be sure to trim off newlines from the input and to append them to your output.

  • All JSON inputs and outputs are terminated by a single line contained "end".

  • The bullet points below are written from the perspective of the script writer's STDIN and STDOUT.

  • Your script will be executed by the Bolt.

  • STDIN: A string representing a path. This is a PID directory. Your script should create an empty file named with it's pid in this directory. e.g. the PID is 1234, so an empty file named 1234 is created in the directory. This file lets the supervisor know the PID so it can shutdown the process later on.

  • STDOUT: Your PID. This is not JSON encoded, just a string. ShellBolt will log the PID to its log.

  • STDIN: (JSON) The Storm configuration. Various settings and properties.

  • STDIN: (JSON) The Topology context

  • The rest happens in a while(true) loop

  • STDIN: A tuple! This is a JSON encoded structure like this:

{
    // The tuple's id
	"id": -6955786537413359385,
	// The id of the component that created this tuple
	"comp": 1,
	// The id of the stream this tuple was emitted to
	"stream": 1,
	// The id of the task that created this tuple
	"task": 9,
	// All the values in this tuple
	"tuple": ["snow white and the seven dwarfs", "field2", 3]
}
  • STDOUT: The results of your bolt, JSON encoded. This can be a sequence of acks, fails, emits, and/or logs. Emits look like:
{
	"command": "emit",
	// The ids of the tuples this output tuples should be anchored to
	"anchors": [1231231, -234234234],
	// The id of the stream this tuple was emitted to. Leave this empty to emit to default stream.
	"stream": 1,
	// If doing an emit direct, indicate the task to sent the tuple to
	"task": 9,
	// All the values in this tuple
	"tuple": ["field1", 2, 3]
}

An ack looks like:

{
	"command": "ack",
	// the id of the tuple to ack
	"id": 123123
}

A fail looks like:

{
	"command": "fail",
	// the id of the tuple to fail
	"id": 123123
}

A "log" will log a message in the worker log. It looks like:

{
	"command": "log",
	// the message to log
	"msg": "hello world!"

}
  • STDOUT: emit "sync" as a single line by itself when the bolt has finished emitting/acking/failing and is ready for the next input

sync

Note: This command is not JSON encoded, it is sent as a simple string.

This lets the parent bolt know that the script has finished processing and is ready for another tuple.