Skip to content

Indexes

Flavian Alexandru edited this page Oct 16, 2016 · 17 revisions

phantom uses a specific set of traits to enforce more advanced Cassandra limitations and schema rules at compile time. Instead of waiting for Cassandra to tell you you've done bad things, phantom won't let you compile them, saving you a lot of time.

The error messages you would get at runtime are now available at compile time, with full domain awareness. That means phantom "knows" what the rules are in Cassandra, so it will automatically prevent you from doing a lot of "bad" things at compile time, one example being using a non index column in a where clause:

Modelling indexes and queries

This is the full list of available options you have with respect to Cassandra features, and this is a guide that shows you have to create every single one of those in phantom.

  • Partition keys
  • Compound keys
  • Composite keys
  • Secondary indexes
  • Indexed collections
  • SASI indexes
  • Materialised views
  • How phantom prevents errors at compile time
  • Tips and tricks

Partition keys.

How phantom pevents errors at compile time

import com.websudos.phantom.dsl._

case class Student(
  id: UUID,
  name: String
)

class Students extends CassandraTable[Students, Student] {
  object id extends UUIDColumn(this) with PartitionKey[UUID]
  object name extends StringColumn(this)

  def fromRow(row: Row): Student = Student(id(row), name(row))
}

object Students extends Students with Connector {

  /**
   * The below code will result in a compilation error phantom produces by design.
   * This behaviour is not only correct with respect to CQL but also intended by the implementation.
   *
   * The reason why it won't compile is because the "name" column is not an index in the "Students" table, which means using "name" in a "where" clause is
   * invalid CQL. Phantom prevents you from running most invalid queries by simply giving you a compile time error instead.
   */
  def getByName(name: String): Future[Option[Student]] = {
    // BOOM, this is a problem. "name" is not a primary key and therefore this query is invalid.
    select.where(_.name eqs name).one()
  }
}

The compilation error message for the above looks something like this, and what it's telling us is that the eqs operator for equality is not available on the name column, and that's because there is no index defined on the name column in the schema DSL.

 value eqs is not a member of object x$9.name

The way it works might seem overly mysterious to start with, but the logic is simple. There is no implicit conversion in scope to convert your non-indexed column to a QueryColumn. If you don't have an index, you can't query.

  // now we are using `id` in the where clause, which is a valid index so this will compile
  Students.update.where(_.id eqs someId).onlyIf(_.name is "test")

This is the default partitioning key of the table, telling Cassandra how to divide data into partitions and store them accordingly. You must define at least one partition key for a table. Phantom will gently remind you of this with a fatal error.

If you use a single partition key, the PartitionKey will always be the first PrimaryKey in the schema. Phantom distinguishes between the two types of keys using separate traits for PartitionKey and PrimaryKey, as well as a separate ClusteringKey when you want to define ordering.

Let's take for example the following CQL table, describing culinary recipes:

CREATE TABLE IF NOT EXISTS somekeyspace.recipes(
  url text,
  description text,
  ingredients list<text>,
  servings int,
  lastcheckedat timestamp,
  props map<text, text>,
  uid uuid,
  PRIMARY KEY (url)
);

To model it in phantom, we basically need a single partition key defined on the schema, and the entire CQL PRIMARY KEY of the table will be composed from a single column. So in this example we chose to index by the url field, just like in the CQL above.

class Recipes extends CassandraTable[ConcreteRecipes, Recipe] {

  // notice we explicitly mix in PartitionKey here.
  object url extends StringColumn(this) with PartitionKey[String]

  object description extends OptionalStringColumn(this)

  object ingredients extends ListColumn[String](this)

  object servings extends OptionalIntColumn(this)

  object lastcheckedat extends DateTimeColumn(this)

  object props extends MapColumn[String, String](this)

  object uid extends UUIDColumn(this)


  override def fromRow(r: Row): Recipe = {
    Recipe(
      url(r),
      description(r),
      ingredients(r),
      servings(r),
      lastcheckedat(r),
      props(r),
      uid(r)
    )
  }
}

Using more than one PartitionKey[T] in your schema definition will output a Composite Key in Cassandra.

The CQL correspondent of the above schema looks like this:

PRIMARY_KEY(
  // First the PartitionKeys
  (your_partition_key_1, your_partition_key2),

  // and then the primary keys or the clustering columns
  primary_key_1, primary_key_2)

As its name says, using this will mark a column as PrimaryKey. Using multiple values will result in a Compound Value. The first PrimaryKey is used to partition data. phantom will force you to always define a PartitionKey so you don't forget about how your data is partitioned.

In essence, a compound key means your table has exactly one partition key and at least one primary key. This mix is called compound

A compound key in C* looks like this.

PRIMARY_KEY(
     partition_key,
     primary_key_1,
     primary_key_2
)```

Before you add too many of these, remember they all have to go into a ```where``` clause.
You can only query with either the full partition key or the full primary key, even if its compound. phantom can't yet give you a compile time error for this, but Cassandra will give you a runtime one.

Because of how Cassandra works, you will only be able to have the following `where` clauses based on the above table:

```sql

SELECT WHERE partition_key = "some value"
SELECT WHERE partition_key = "some value" AND primary_key_1 = "some_other_1" AND primary_key_2 = "some_other_2"

If you want any other kinds of where clause matches, you will need to basically use alternative modelling approaches to obtain them. Obviously you can mix any kind of valid where operators above, so you don't have to use eqs.

This is a secondary index in Cassandra. It can help you enable querying really fast, but it's not exactly high performance. It's generally best to avoid it, we implemented for the sake of being feature complete.

When you mix in Index[T] on a column, phantom will let you use it in a where clause.

When you want to use a column in a where clause, you need an index on it. Cassandra data modelling is a more convoluted topic, but phantom offers com.websudos.phantom.keys.Index to enable querying.

The CQL 3 schema for secondary indexes can also be auto-generated with ExampleRecord4.create() and is directly taken care of by table auto-generation. Phantom is capable of analysing your schema DSL and creating indexes at the correct time in the application, namely only after the tables have been created, otherwise the index creation will fail since it will refer to a non-existing table.

SELECT is the only query you can perform with an Index column. This is a Cassandra limitation. The relevant tests are found here.

import com.websudos.phantom.dsl._

sealed class ExampleRecord4 extends CassandraTable[ExampleRecord4, ExampleModel] {

  object id extends UUIDColumn(this) with PartitionKey[UUID]
  object order_id extends LongColumn(this) with ClusteringOrder[Long] with Descending
  object timestamp extends DateTimeColumn(this) with Index[DateTime]
  object name extends StringColumn(this) with Index[String]
  object props extends MapColumn[ExampleRecord2, ExampleRecord, String, String](this)
  object test extends OptionalIntColumn(this)

  override def fromRow(row: Row): ExampleModel = {
    ExampleModel(id(row), name(row), props(row), timestamp(row), test(row));
  }
}

This can be used with either java.util.Date or org.joda.time.DateTime, or a anything that has an order really. It tells Cassandra to store records in a certain order based on this field. Bear in mind, only the partition is ordered, not the whole table.

What this means is that the PartitionKey part of your primary key is not used to determine ordering, but used to determine the partition. In practice, let's assume you have a table of events for every sensor emitting data, something like a solar panel system, and you want to store data for every panel.

The schema at a very simplistic level might look something like this.

CREATE TABLE events_by_sensor(
  sensor_id uuid,
  event_id timeuuid,
  event_name text,
  event_category text,
  solary_capacity bigint,
  PRIMARY_KEY(sensor, event_id)
)

So we are using the sensor_id to be able to fetch all events for a given sensor_id with SELECT WHERE sensor_id = ?, and based on that we can retrieve specific sensor records. Let's assume we want to make sure only the most recent events are retrieved. Cassandra gives us CLUSTERING ORDER for that very reason.

CREATE TABLE events_by_sensor(
  sensor_id uuid,
  event_id timeuuid,
  event_name text,
  event_category text,
  solary_capacity bigint,
  PRIMARY_KEY(sensor, event_id)
) WITH CLUSTERING ORDER BY (event_id desc)

But this ordering will be specific to a particular sensor_id. What this means is that the following query will give you the 10 most recent events. This is because we told Cassandra to cluster records inside the sensor partition by their event_id, and because the event_id is a timeuuid, it will be compared by the timestamp contained within the timeuuid, and that will happen on every insert, Cassandra will figure out where in the sensor_id partition to insert the new event to satisfy ordering requirements.

SELECT * FROM keyspace.events_by_sensor WHERE sensor_id = ? LIMIT 10

However, if you were to simply retrieve 10 records at random from the table, you have no implicit ordering. So the ordering you define on the clustering columns will not influence the ordering of the partition keys at all, which makes perfect sense if you think about how Cassandra nests data, but may confuse you at first.

SELECT * FROM keyspace.events_by_sensor WHERE LIMIT 10.

An example might be: object timestamp extends DateTimeColumn(this) with ClusteringOrder[DateTime] with Ascending.

To fully define a clustering column, you MUST also mixin either Ascending or Descending to indicate the sorting order.

back to top

Phantom also supports using Compound keys out of the box. The schema can once again by auto-generated. One important thing to note is that if you want to use a compound key and define ClusteringOrder, you will need to define ClusteringOrder on all the columns that would normally be a PrimaryKey. The reason for this is because Cassandra needs to know how to order the records inside the partition, and it cannot order just on a single field.

A table can have multiple PartitionKey keys and several PrimaryKey definitions. Phantom will use these keys to build a compound value. Example scenario, with the compound key: (id, timestamp, name)

import com.websudos.phantom.dsl._

sealed class ExampleRecord3 extends CassandraTable[ExampleRecord3, ExampleModel] {

  object id extends UUIDColumn(this) with PartitionKey[UUID]
  object order_id extends LongColumn(this) with ClusteringOrder[Long] with Descending
  object timestamp extends DateTimeColumn(this) with ClusteringOrder[DateTime] with Descending
  object name extends StringColumn(this) with ClusteringOrder[String] with Ascending
  object props extends MapColumn[ExampleRecord2, ExampleRecord, String, String](this)
  object test extends OptionalIntColumn(this)

  override def fromRow(row: Row): ExampleModel = {
    ExampleModel(id(row), name(row), props(row), timestamp(row), test(row));
  }
}

back to top

import scala.concurrent.Await
import scala.concurrent.duration._
import com.websudos.phantom.dsl._

sealed class ExampleRecord2 extends CassandraTable[ExampleRecord2, ExampleModel] {

  object id extends UUIDColumn(this) with PartitionKey[UUID]
  object order_id extends LongColumn(this) with ClusteringOrder[Long] with Descending
  object timestamp extends DateTimeColumn(this)
  object name extends StringColumn(this)
  object props extends MapColumn[ExampleRecord2, ExampleRecord, String, String](this)
  object test extends OptionalIntColumn(this)

  override def fromRow(row: Row): ExampleModel = {
    ExampleModel(id(row), name(row), props(row), timestamp(row), test(row));
  }
}


val orderedResult = Await.result(Articles.select.where(_.id gtToken one.get.id ).fetch, 5000 millis)

back to top

Operator name Description
eqsToken The "equals" operator. Will match if the objects are equal
gtToken The "greater than" operator. Will match a the record is greater than the argument
gteToken The "greater than or equals" operator. Will match a the record is greater than the argument
ltToken The "lower than" operator. Will match a the record that is less than the argument and exists
lteToken The "lower than or equals" operator. Will match a the record that is less than the argument

For more details on how to use Cassandra partition tokens, see SkipRecordsByToken.scala