Skip to content
Chen Li edited this page Aug 29, 2019 · 65 revisions

BOSS 2019 Tutorial for Cloudberry

Welcome to the Cloudberry tutorial at BOSS 2019 co-located with VLDB 2019!

I) Introduction

In this tutorial, you'll do the following steps:

  • Get your machine
    • We provide each of you an AWS EC2 machine. You'll register your name to access one machine and keep its IP address for use.
  • Start Cloudberry and Twittermap
    • Start your own Cloudberry middleware, and Twittermap application on an Amazon AWS EC2 machine that we provide.
    • Play with the Twittermap Application to enjoin the power of data visualization.
  • Try examples querying Cloudberry through RESTFul API
    • Run real queries against Cloudberry to learn how to use it to accelerate your application queries through Cloudberry's RESTFul API.
    • Enjoin the power of Cloudberry to improve user experience.

The following diagram illustrates Twittermap architecture

II) Get your AWS EC2 machine's IP address

Open the following Google Spreadsheet

Cloudberry BOSS 2019 AWS EC2 IPs

Register your name after an IP address

Important!

Suppose the IP address you registered is

ec2-35-163-10-157.us-west-2.compute.amazonaws.com,

let's call it [your aws ip],

in the following tutorial, we'll use [your aws ip] to indicate the IP address you registered.

To Open a terminal connecting to your AWS machine

Open a terminal on your own laptop.

Run the following command:

ssh boss@[your aws ip]

Type password: boss

If you see the following message, you're in! ^^

Last login: Wed Aug 21 17:31:55 2019 from dhcp-v097-215.mobile.uci.edu

       __|  __|_  )
       _|  (     /   Amazon Linux 2 AMI
      ___|\___|___|

https://aws.amazon.com/amazon-linux-2/
4 package(s) needed for security, out of 12 available
Run "sudo yum update" to apply all updates.
[boss@ip-172-31-29-92 ~]$ 

III) System settings

  • 8GB memory
  • 2 vCPUs
  • OS: Amazon Linux
  • Installed Software:
    • Java 8 SDK
    • sbt 0.13 (scala building tool)
    • AsterixDB
  • Data:
    • Sample Tweets (1M): 2018-01-01 ~ 2019-08-24

IV) Start Cloudberry and Twittermap

Note Backend database (AsterixDB) is already installed and started for you. Web Console of AsterixDB is http://[your aws ip]:19001

1. Start Cloudberry

1.1 Go to ~/cloudberry/cloudberry directory

cd ~/cloudberry/cloudberry

2.2 Start Cloudberry

sbt "project neo" "run"

Wait until you see the following messages:

[info] Loading project definition from /home/cloudberry/cloudberry/cloudberry/project
[info] Set current project to cloudberry (in build file:/home/cloudberry/cloudberry/cloudberry/)
[info] Set current project to neo (in build file:/home/cloudberry/cloudberry/cloudberry/)

--- (Running the application, auto-reloading is enabled) ---

[info] p.c.s.NettyServer - Listening for HTTP on /0:0:0:0:0:0:0:0:9000

(Server started, use Ctrl+D to stop and go back to the console...)

2. Start TwitterMap

2.1 Open a new terminal connecting to your AWS machine

ssh boss@[your aws ip]

2.2 Go to ~/cloudberry/examples/twittermap directory

cd ~/cloudberry/examples/twittermap

2.3 Start Twittermap

sbt "project web" "run 9001"

Wait until the shell prints the following messages:

--- (Running the application, auto-reloading is enabled) ---

[info] p.c.s.NettyServer - Listening for HTTP on /0:0:0:0:0:0:0:0:9001

(Server started, use Ctrl+D to stop and go back to the console...)

3 Visualizing tweets on TwitterMap

Open your Internet Browser and go this link http://[your aws ip]:9001.

The first time you open the page, it could take up to several minutes to show the following Web page:

(Note: Firefox users have to go to about:config and change privacy.trackingprotection.enabled to false and restart Firefox)

twittermap-screenshot

Congratulations! You have successfully set up TwitterMap using Cloudberry and AsterixDB!

V) Under-the-hood of Cloudberry

1. Check Schema Metadata in Cloudberry

Access http://[your aws ip]:9000 to check the schema of datasets in Cloudberry.

schema

The below json object describes the schema used in the TwitterMap application, which is already registered in Cloudberry:

{
  "name":"twitter.ds_tweet",
  "schema":{
    ...
    "dimension":[
      {"name":"create_at","isOptional":false,"datatype":"Time"},
      {"name":"id","isOptional":false,"datatype":"Number"},
      {"name":"coordinate","isOptional":false,"datatype":"Point"},
      {"name":"hashtags","isOptional":true,"datatype":"Bag","innerType":"String"},
      {"name":"geo_tag.stateID","isOptional":false,"datatype":"Number"},
      {"name":"geo_tag.countyID","isOptional":false,"datatype":"Number"},
      {"name":"geo_tag.cityID","isOptional":false,"datatype":"Number"},
      ...
    ],
    "measurement":[
      {"name":"text","isOptional":false,"datatype":"Text"}
    ],
    ...
  },
  ...
}

2. Check views in Cloudberry metadata

Access http://[your aws ip]:19001 to check the metadata table used by Cloudberry inside AsterixDB.

Copy the following query to the Query box, and click Run button.

select name as `viewName`, createQuery.`filter`[0] as `filter` from berry.meta where createQuery is not unknown;

The result should include all the materialized views and their filter conditions, as following.

{ "viewName": "twitter.ds_tweet_c5285abd05d46a954151ddf9b8128114", "filter": { "field": "text", "relation": "contains", "values": [ "hurricane" ] } }
...

Note: as we send queries as following, you should see more datasets (ds_tweet_[hash value]) are created as they represent materialized views

3. Send requests to Cloudberry

You can send requests to Cloudberry using its RESTful API by providing a query as a json object.

You can use either the web console or curl command to send requests to Cloudberry.

Using web console (Recommended)

Access http://[your aws ip]:9000 -> query tab, copy json request to the Query box and click Submit button.

Using curl command

Go to ~/queries directory.

curl -X POST -H "Content-Type: application/json" -d @[query filename] http://localhost:9000/berry

(Note: type Ctrl+C to stop query result mode from curl)

3.1. Select Query based on a filter condition [select-id.json]

Return the tweet where the tweet ID equals a specific value

  • Use Case in Twittermap

Hover over any point on pinmap

pinmap-query

  • Copy the id value from the hyperlink and paste to the values entry in the following query

pinmap-query-id

{
  "dataset": "twitter.ds_tweet",
  "filter": [{
    "field": "id",
    "relation": "=",
    "values": "962724417301184512"
  }],
  "select" : {
    "order" : [],
    "limit": 1,
    "offset" : 0,
    "field": ["*"]
  }
}

3.2. Group-By Categorical Attribute [count-per-state.json]

Return per-state count of tweets that contain keywords earthquake

Use Case in Twittermap

Countmap

countmap-query

{
  "dataset": "twitter.ds_tweet",
  "filter": [
    {
      "field": "text",
      "relation": "contains",
      "values": ["earthquake"]
    }
  ],
  "group": {
    "by": [
        {
          "field": "geo_tag.stateID",
          "as": "state"
        }
      ],
    "aggregate": [
      {
        "field": "*",
        "apply": {
          "name": "count"
        },
        "as": "count"
      }
      ]
  }
}

(Notice the query time when there is no view. Now repeat Step 2. and see the newly added View. Repeat Step 3.2 and note how the time reduced exploiting the view)

3.3. Group-By Numerical Attribute [count-per-day.json]

Return per-day count of tweets that contain keywords earthquake

Use Case in Twittermap

Timebar

timebar-query

{
  "dataset": "twitter.ds_tweet",
  "filter": [
    {
      "field": "text",
      "relation": "contains",
      "values": ["earthquake"]
    }
  ],
  "group": {
    "by": [
        {
          "field": "create_at",
          "apply": {
            "name": "interval",
            "args": {
              "unit": "day"
            }
          },
          "as": "day"
        }
      ],
    "aggregate": [
      {
        "field": "*",
        "apply": {
          "name": "count"
        },
        "as": "count"
      }
      ]
  }
}

3.4. Expensive Query without Slicing [expensive-count.json]

  • Return per-state count of tweets that contain keywords job.
{
  "dataset": "twitter.ds_tweet",
  "filter": [
    {
      "field": "text",
      "relation": "contains",
      "values": ["job"]
    }
  ],
  "group": {
    "by": [
        {
          "field": "geo_tag.stateID",
          "as": "state"
        }
      ],
    "aggregate": [
      {
        "field": "*",
        "apply": {
          "name": "count"
        },
        "as": "count"
      }
    ]
  }
}

If no materialized view available for a popular keyword, the first time the query could take a long time, 10~15 seconds. Without Query Slicing, the user experience is intolerable.

3.5. Expensive Query with Slicing (DRUM) [drum.json]

Drum is a technique to support progressive results in batches.

  • Return per-state count of tweets that contain keywords job.
  • Optionally return the results progressively with an interval of 500ms.
{
  "dataset": "twitter.ds_tweet",
  "filter": [
    {
      "field": "text",
      "relation": "contains",
      "values": ["job"]
    }
  ],
  "group": {
    "by": [
        {
          "field": "geo_tag.stateID",
          "as": "state"
        }
      ],
    "aggregate": [
      {
        "field": "*",
        "apply": {
          "name": "count"
        },
        "as": "count"
      }
    ]
  },
  "select" : {
    "order" : ["-count"],
    "limit": 100,
    "offset" : 0
  },
  "option":{
   "sliceMillis": 2000
  }
}

(Notice how the results are returned in batches. Also notice that the data in every batch includes the previous results. Finally, notice the last batch result includes the keyword "DONE" to indicate the full result returned. The rhythm can be tuned in the SliceMillis option)

3.6. Switch between Accumulate mode and Delta mode [delta.json]

  • By default, Cloudberry returns the accumulated results progressively, i.e., the results returned in each batch are the entire aggregation numbers computed so far.

  • If you want the progressive results to be returned just for each new batch, you can just add another entry into the options element: returnDelta: true.

  • Return per-state count of tweets that contain keywords job

  • Optionally return the results progressively with an interval of 500ms in the delta mode.

{
  "dataset": "twitter.ds_tweet",
  "filter": [
    {
      "field": "text",
      "relation": "contains",
      "values": ["job"]
    }
  ],
  "group": {
    "by": [
        {
          "field": "geo_tag.stateID",
          "as": "state"
        }
      ],
    "aggregate": [
      {
        "field": "*",
        "apply": {
          "name": "count"
        },
        "as": "count"
      }
    ]
  },
  "select" : {
    "order" : ["-count"],
    "limit": 100,
    "offset" : 0
  },
  "option":{
   "sliceMillis": 2000,
   "returnDelta": true
  }
}
  • Send the query again to see the difference.

You've finished this tutorial! Congratulations!

For further information about how to use Cloudberry and contribute to the project, please refer to the Documentations in our Github Wiki.