Skip to content

Commit

Permalink
Merge pull request asciidoctor#11 from datastax-training/curriculum-b…
Browse files Browse the repository at this point in the history
…uild

Curriculum Build
  • Loading branch information
tlberglund committed Apr 20, 2015
2 parents 9144fce + f2b353b commit 0184c0f
Show file tree
Hide file tree
Showing 17 changed files with 416 additions and 319 deletions.
24 changes: 14 additions & 10 deletions cassandra/dev/data-modeling/use-cases/sensor-data/build.gradle
Original file line number Diff line number Diff line change
@@ -1,14 +1,14 @@
buildscript {
repositories {
mavenLocal()
mavenCentral()
jcenter()
}
dependencies {
classpath "com.github.houbie:lesscss-gradle-plugin:1.0.3-less-1.7.0"
classpath 'com.bluepapa32:gradle-watch-plugin:0.1.5'
classpath 'org.asciidoctor:asciidoctor-gradle-plugin:1.5.2'
}
repositories {
mavenLocal()
mavenCentral()
jcenter()
}
dependencies {
classpath "com.github.houbie:lesscss-gradle-plugin:1.0.3-less-1.7.0"
classpath 'com.bluepapa32:gradle-watch-plugin:0.1.5'
classpath 'org.asciidoctor:asciidoctor-gradle-plugin:1.5.2'
}
}

apply plugin: 'com.bluepapa32.watch'
Expand All @@ -25,3 +25,7 @@ task slides(type: AsciidoctorTask)
task docs(type: AsciidoctorTask)

apply from: "${curriculumRootDir}/gradle/plugins/curriculum.gradle"

docs {
attributes 'image_path': 'images'
}
Original file line number Diff line number Diff line change
@@ -1,6 +1,3 @@
:source-highlighter: pygments

= DS220 Apache Cassandra Data Modeling

== Data Modeling Use Case

Expand All @@ -17,9 +14,9 @@

. Review the data modeling steps.

image::images/usecaseimage1.jpg[]
image::{image_path}/usecaseimage1.jpg[]

image::images/investmentreview.svg[]
image::{image_path}/investmentreview.svg[]


==== *Instantiate and query the database*
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
include::{slide_path}/what-are-sensor-applications.adoc[]
include::{slide_path}/introduction.adoc[]
include::{slide_path}/conceptual-data-model.adoc[]
include::{slide_path}/application-workflow.adoc[]
include::{slide_path}/application-workflow.adoc[]
include::{slide_path}/logical-data-model.adoc[]
include::{slide_path}/analysis.adoc[]
include::{slide_path}/physical-data-model.adoc[]
305 changes: 3 additions & 302 deletions cassandra/dev/data-modeling/use-cases/sensor-data/src/slides.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -8,306 +8,7 @@ DataStax Training
:notes:
:split:

== What are sensor applications?

* Agriculture
* Environment and natural resources
* Healthcare and wellness
* Homeland security
* Military
* Monitoring and control
* Retail
* Robotics and automation
* Smart home/office/auto
* Telematics
* Utilities

== Sensor data: use case introduction

*Data description*

* Multiple sensor networks are deployed over non-overlapping regions
* A sensor network is identified by a unique name
* A sensor belongs to exactly one network
* A sensor has a unique identifier, location, and characteristics (e.g. accuracy, cost, manufacturing date)
* A sensor records new measurements (e.g. temperature, humidity, pressure) every second

*_We will focus on temperature in this example_*

== Sensor data: use case introduction

*Application queries*

* Q~1~: Find information about all networks; order by name (ASC)
* Q~2~: Find hourly average temperatures for all sensors in a specified network for a specified date range; order by date (DESC) and hour (DESC)
* Q~3~: Find information about all sensors in a specified network
* Q~4~: Find raw measurements for a particular sensor on a specified date; order by timestamp (DESC)

[.notes]
--
Q2 is an example of time series data. We order by date and then hour.

Q4 is an another example of time series data. Sensors continually collect data and store that data with their timestamp.
--

== Sensor data: conceptual data model

*Keys*

* [blue]#*has*#: sensor id
* [blue]#*records*# and [blue]#*Measurement*#: sensor id, timestamp, parameter

image::images/conceptualdatamodel.svg[]

[.notes]
--
Measurement's [emphasis]#*parameter*# attribute indicates [emphasis]#*value*#'s unit (temperature, humidity, etc.). In this example, we always record temperature.

The double-lined diamond indicates an identifying relationship. The double-lined [emphasis]#*Measurement*# indicates a weak entity type. Thus, a [emphasis]#*Measurement*# cannot exist without an identifying [emphasis]#*Sensor*#. If we delete a [emphasis]#*Sensor*#, we must delete all of its associated [emphasis]#*Measurements*#. Without a [emphasis]#*Sensor*#, the [emphasis]#*Measurement*# does not have a [emphasis]#*location*# or [emphasis]#*characteristics*#.

The key of the weak entity type also depends on the key of the strong entity type in the identifying relationship. Thus, [emphasis]#*Sensor#*'s [emphasis]#*id*# makes part of [emphasis]#*Measurement*#'s key, as the slide indicates.

--

== Sensor data: application workflow

image::images/applicationworkflow.svg[]

[.notes]
--
We organize our queries by work flow. The first query retrieves all networks, identified by their names (in this case, we name the networks by number). The second query uses the network name to retrieve the hourly average temperature in a given date range. Using that information, we can generate a heat map for a single point in time. We can also generate a heat-map animation over a time range.

Using the third query, we can produce a geographical image of all of our sensors. The user can then click a sensor for which we can further provide the raw data for a specific day via the fourth query.
--

== Sensor data: logical data model

image::images/logicaldatamodel.svg[]

[.notes]

For Q1, the [emphasis]#*Network#* table stores all of the networks in a single partition. Thus, the partition key is a dummy value. To retrieve this single partition, we write:

****
SELECT * FROM Networks
****

This storage technique does not require a WHERE clause.

The dummy data can be any data. Each row in the CQL result represents a network. The partition is small as we only have a handful of networks.

For Q2, we retrieve the average hourly temperature for a given date range. We also record the sensor location to later produce the heat map.

For Q3, we retrieve all the sensors within a network. Making [emphasis]#*sensor*# part of the primary key handles the "[emphasis]#*Network*# has [emphasis]#*Sensor*#" relationship.


== Sensor data: analysis

*Partition size*

*Networks*
* One small partition

image::images/logical_networks.svg[float="left"]

<<<

--
*Sensor_by_network*

* Assume atn most 1,000 sensors per network
* Manageable partitions

image::images/logical_sensors_by_network.svg[float="left"]
--

<<<

--
*Temperatures_by_sensor*

* 86,400 seconds per day
* Manageable partitions

image::images/logical_temperatures_by_sensor.svg[float="left"]
--

[.notes]
--
The next step is to analyse partition sizes for manageability.

*_Networks_*

* The single partition is small as we only have a dozen or so networks.

*_Sensor_by_network_*

* With only 1000 sensors per network, we have manageable partitions.
* We can apply the formula we saw earlier to prove this.

*_Temperatures_by_sensor_*

* We sample temperatures each second and partition by one day.
* 86,400 samples make a manageable partition.
--

== Sensor data: analysis

*Partition size*

*Temperatures_by_network*

** Assume at most 1,000 sensors per network
** 24 hours per day

image::images/logical_temperatures_by_network.svg[float="left"]

<<<

*_365 days_* (1 year)

**** 365 x 24 x 1000 = 8,765,000
**** Large partition


<<<

*_30 days_* (1 month)

**** 30 x 24 x 1000 = 720,000
**** Somewhat manageable

<<<

*_7 days_* (1 week)

**** 7 x 24 x 1000 = 168,000
**** Manageable partitions

[.notes]
The partition size for [emphasis]#*Temperatures_by_network*# will grow too large. In one partition, we accumulate 1,000 average temperatures every hour. Although we may handle this for the first month, our partition sizes will quickly become unmanageable. We can fix this by dividing our partitions by weeks.

== Sensor data: analysis
*Duplication*

* How many times is *_region_* stored per network?
* How many times is *_location_* stored per sensor?

image::images/logical_networks.svg[]
image::images/logical_sensors_by_network.svg[]
image::images/logical_temperatures_by_network.svg[]

[.notes]
--
In our analysis, we also want to remove unnecessary duplication.

Although region appears in [emphasis]#*Networks*# and [emphasis]#*Temperatures_by_network*#, duplication is minimal. In [emphasis]#*Networks*#, [emphasis]#*region*# will be a unique value. In [emphasis]#*Temperatures_by_network*#, [emphasis]#*region*# is a static column.

We store each location value once in [emphasis]#*Sensor_by_network*#. However, in [emphasis]#*Temperatures_by_network*#, we store [emphasis]#*location*# several times. Although this appears to be unnecessary duplication, we need [emphasis]#*location* here to generate our heat map. So [emphasis]#*location*# is duplicate data, but it is not duplicate information. Also, adding [emphasis]#*location*# to [emphasis]#*Temperatures_by_network*# increases the value size by a factor of two (it is one more column in addition to [emphasis]#*avg_temp*#).
--

== Sensor data: physical data model

image::images/physicaldatamodel.svg[]

[.notes]
--
We added week to our [emphasis]#*Temperatures_by_network*# partition key to make our partition sizes manageable. We can say that the first week in January of 2010 represents week 1 and so forth. Thus the first 52 weeks make up 2010, and then the first week of 2011 is 53. One downside to this approach is if you want to create a heat map that spans several weeks, your query must retrieve multiple partitions.

We also merged the date and hour into [emphasis]#*date_hour*# because the TIMESTAMP data type can store both.
--

== Sensor data: physical data model

****
image::images/physical_networks.svg[float="right"]
CREATE TABLE networks (
dummy TEXT,
name TEXT,
region TEXT,
description TEXT,
n_sensors INT,
PRIMARY KEY (dummy, name)
);
-- Q1
SELECT *
FROM networks;
****

[.notes]

Throughout this and the next three slides, notice that the queries are simple because we designed our tables specifically to support them.

== Sensor data: physical data model

****
image::images/physical_temperatures_by_network.svg[float="right"]
CREATE TABLE temperatures_by_network (
network TEXT,
week INT,
date_hour TIMESTAMP,
sensor TEXT,
avg_temp FLOAT,
location TEXT,
region TEXT STATIC,
PRIMARY KEY ( (network, week), date_hour, sensor)
)
WITH CLUSTERING ORDER BY (date_hour DESC, sensor ASC);
-- Q2
SELECT * FROM temperatures_by_network
WHERE network = ? AND week = ?
AND date_hour >= ? AND date_hour <= ?;
****

== Sensor data: physical data model

****
image::images/physical_sensors_by_network.svg[float="right"]
CREATE TABLE sensors_by_network (
network TEXT,
sensor TEXT,
location TEXT,
chracteristics MAP<TEXT,TEXT>,
PRIMARY KEY (network, sensor)
);
-- Q3
SELECT * FROM sensors_by_network
WHERE network = ?;
****

== Sensor data: physical data model

****
image::images/physical_temperatures_by_sensor.svg[float="right"]
CREATE TABLE temperatures_by_sensor (
sensor TEXT,
date TIMESTAMP,
ts TIMESTAMP,
temp FLOAT,
PRIMARY KEY ((sensor, date), ts)
)
WITH CLUSTERING ORDER BY (ts DESC);
-- Q4
SELECT * FROM temperatures_by_sensor
WHERE sensor = ? AND date = ?;
****

== End of presentation
:slide_path: slides
:image_path: images
include::includes.adoc[]
Loading

0 comments on commit 0184c0f

Please sign in to comment.