Welcome to the setup guide for Kafka Salsa. This guide covers:
- Installation: Building and running Kafka Salsa.
- Specify the Application: Choosing and configuring an implementation approach.
- Deploy: Run Kafka Salsa locally or deploy to Kubernetes.
- Load Data: Ingest data into your Kafka cluster.
- REST API: Query recommendations or the graph store directly over REST.
- Clone the repository:
git clone [email protected]:torbsto/kafka-salsa.git
- Install Apache Maven.
- Navigate into this repository:
cd ./kafka-salsa
- Build the project with Maven:
mvn package
- (Optional for local development) Install Docker and start a local Kafka cluster:
cd ./dev/ && docker-compose up
- Run Kafka Salsa:
java -jar target/kafka-salsa.jar ...
- Note that Kafka Salsa contains four different implementation approaches. You can specify which approach to use using the command line. We will cover how to run the different approaches in the next section.
Kafka Salsa implements four approaches to store and query the user-tweet-interaction graph. You must specify which approach to use by choosing a Kafka Streams Processor on startup.
Command | Description |
---|---|
range-key | RangeKey Edge Processor |
sampling | Sampling Edge Processor |
segmented | Edge processor with GraphJet-like engine |
simple | Simple Edge Processor |
The command is simply the first parameter for to the JAR call from above:
java -jar target/kafka-salsa.jar simple ...
All approaches share a set of parameters that can be specified on start. Please note that all required parameters must be specified:
Parameter | Required | Description | Default |
---|---|---|---|
--application-id | yes | Name of streams application | - |
--host | yes | Host address of the REST service | - |
--port | no | Port of REST service | 8070 |
--brokers | yes | Address and port of kafka brokers | - |
--schema-registry-url | yes | Address and port of schema registry | - |
--topic | no | Name of the input topic | edges |
Some approaches support additional parameters. We initialize them with sensible defaults, but can be adjusted for your use-case:
Parameter | Required | Description | Default |
---|---|---|---|
--buffer | no | Size of buffer for sampling edge processor | 5000 |
Parameter | Required | Description | Default |
---|---|---|---|
--segments | no | Segments inside graphjet index | 10 |
--pools | no | Pools inside graphjet segment | 16 |
--nodesPerPool | no | Nodes per graphjet pool | 131072 |
We provide a docker-compose setup for local development and testing purposes. It contains the services Zookeeper, Kafka and Confluent's Schema Registry. Execute docker-compose up
in the ./dev/
directory to start the services.
To run the Kafka Salsa with the local Docker setup running, execute the following command:
java -jar target/kafka-salsa.jar simple --host=localhost --brokers=localhost:29092 --schema-registry-url=http://localhost:8081
We also provide a bash script that deploys our full Kafka Salsa setup to Microsoft Azure using Kubernetes.
sudo chmod -R +x ./kubernetes/
cd ./kubernetes/
./run.sh
We provide two Kafka Producers that help ingest data into your Kafka cluster (local or remote). Two Kafka Producers are located in the de.hpi.msd.salsa.producer
package. The MockDataProducer.java
creates random data in a fixed time interval, and the CsvDataProducer.java
can ingest CSV data into a topic. To ingest our evaluation dataset from twitter-dataset, use the CsvDataProducer.java
.
The REST API consists of a recommendation service and an adjacency query service. The responses are in JSON.
Get the top n recommendations for a user:
GET http://localhost:8070/recommendation/salsa/userId?limit=n
Specify the length and number of random SALSA walks for the recommendation:
GET http://localhost:8070/recommendation/salsa/userId?limit=n&walks=10&walk_length=100
You can query the graph store directly, which is useful for debugging.
Get the degree of a node:
GET http://localhost:8070/state/[left|right]Node/id/degree
Get the neighbors of a node:
GET http://localhost:8070/state/[left|right]Node/id/neighborhood