In the following, we assume that the reader is using an OS environment where Python 3 and its corresponding pip
helper tool are working correctly and without throwing errors.
This is rarely the case when using the default Python distribution that comes with modern OSes, the main reason being the confusion that is sometimes caused by having both Python 2 and 3 installed, using pip
and pip3
interchangeably, missing OS-level dependencies, inconsistencies in Homebrew setups etc.
One solution to this problem is to use Anaconda, a Python distribution that was created primarily to help data scientists build their solutions on a standard Python environment.
Anaconda installs well-tested versions of a curated set of Python packages that are known to work well together, including Pandas, Numpy, Jupyter Notebook, SciKit Learn, Flask and many others.
Anaconda installers can be found here.
At the time of this writing (July 2018), the most recent distribution of Anaconda3
for Python 3 / MacOS is:
https://repo.continuum.io/archive/Anaconda3-5.2.0-MacOSX-x86_64.pkg
After installing Anaconda3
, open a new terminal window and confirm locations and versions with:
which python
python -V
which pip
pip -V
After this, upgrade pip and install the Python packages required by the MorphL project:
pip install --upgrade pip
pip install google-auth google-api-python-client tensorflow cassandra-driver
Side Note: The MorphL server environment runs exclusively on Anaconda3
.
At this point you should have already received these environment variables from one of your coworkers:
export MORPHL_SERVER_IP_ADDRESS=???.???.???.???
export MORPHL_CASSANDRA_PASSWORD=?????
Place these commands in your ~/.bash_profile
and open a new terminal window.
Confirm that the environment variables are now set:
env | grep MORPHL
This setup will mirror on your laptop the server-side environment where the Python programs will run.
Cassandra features a command-line tool called cqlsh
that is useful for viewing and operating on data in Cassandra.
The following is the recommended way to install it:
mkdir ~/opt
curl -so ~/opt/cassandra.tgz http://www-eu.apache.org/dist/cassandra/3.11.3/apache-cassandra-3.11.3-bin.tar.gz
tar -xf ~/opt/cassandra.tgz -C ~/opt
mv ~/opt/apache-cassandra-* ~/opt/cassandra
rm ~/opt/cassandra.tgz
You can now work with Cassandra directly by running the following command:
/usr/bin/python ~/opt/cassandra/bin/cqlsh.py ${MORPHL_SERVER_IP_ADDRESS} -u morphl -p ${MORPHL_CASSANDRA_PASSWORD}
The output you see should be similar to:
Connected to MorphLCluster at ???.???.???.???:9042.
[cqlsh 5.0.1 | Cassandra 3.11.3 | CQL spec 3.4.4 | Native protocol v4]
Use HELP for help.
morphl@cqlsh>
Note: cqlsh
needs Python 2, hence why we have to use /usr/bin/python
.
Cassandra tables exist within keyspaces (keyspace
is the term Cassandra uses to designate a namespace).
To switch to the already defined morphl
keyspace, run the following command:
USE morphl;
The cqlsh
prompt should change to reflect the name of the keyspace:
morphl@cqlsh> USE morphl;
morphl@cqlsh:morphl>
List the tables in this keyspace:
DESC TABLES;
The output:
morphl@cqlsh:morphl> DESC TABLES;
ps_area
morphl@cqlsh:morphl>
View the definition of the ps_area
table:
DESC ps_area;
The output:
morphl@cqlsh:morphl> DESC ps_area;
CREATE TABLE morphl.ps_area (
client_id text,
day_of_data_capture date,
days_since_last_seen int,
PRIMARY KEY (client_id, day_of_data_capture)
) WITH CLUSTERING ORDER BY (day_of_data_capture DESC)
Note: The above is not the complete output. We choose to ignore the rest of the output because it doesn't pertain to our data modeling concerns.
Note: The way we initially created table morphl.ps_area
is this:
CREATE TABLE morphl.ps_area (
client_id text,
day_of_data_capture date,
days_since_last_seen int,
PRIMARY KEY ((client_id), day_of_data_capture)
) WITH CLUSTERING ORDER BY (day_of_data_capture DESC);
There is one subtle difference, and we are going to cover it later.
cqlsh
provides the flag -f
which can be used like this:
/usr/bin/python ~/opt/cassandra/bin/cqlsh.py ${MORPHL_SERVER_IP_ADDRESS} -u morphl -p ${MORPHL_CASSANDRA_PASSWORD} -f /tmp/abc.cql
where /tmp/abc.cql
is a file that contains CQL statements.
From a storage point of view, a Cassandra table is spread across multiple partitions.
In our example, all data for one particular client_id
is stored in its own partition, physically separate from all other partitions.
This is reflected in how we designed the primary key:
PRIMARY KEY ((client_id), day_of_data_capture)
Here the client_id
is the partitioning key, and this was made explicit by surrounding the column name with an extra set of parentheses: (client_id)
.
The second part of the primary key is the clustering key, in our case it's the column day_of_data_capture
.
The clustering key is used to sort the data within a partition, in our case in descending order, as reflected by the clause:
WITH CLUSTERING ORDER BY (day_of_data_capture DESC)
As a consequence of Cassandra's design, we are in a position to run, for certain use cases, SELECT
queries that look slightly different from what you are used to:
SELECT * FROM ps_area PER PARTITION LIMIT 1;
The output:
morphl@cqlsh:morphl> SELECT * FROM ps_area PER PARTITION LIMIT 1;
client_id | day_of_data_capture | days_since_last_seen
-----------+---------------------+----------------------
GA1 | 2018-07-05 | 4
GA2 | 2018-07-10 | 6
(2 rows)
This query will extract the first record from each client_id
partition, and because data within each partition is sorted by day_of_data_capture
in descending order, saying "the first record" is equivalent to saying "the most recently captured record".
Obviously, we can ask for the entire contents of the table:
SELECT * FROM ps_area;
The output:
morphl@cqlsh:morphl> SELECT * FROM ps_area;
client_id | day_of_data_capture | days_since_last_seen
-----------+---------------------+----------------------
GA1 | 2018-07-05 | 4
GA1 | 2018-07-01 | 15
GA2 | 2018-07-10 | 6
GA2 | 2018-07-03 | 25
(4 rows)
Inserts also behave slightly differently in Cassandra, as exemplified by these operations:
INSERT INTO ps_area (client_id,day_of_data_capture)
VALUES ('GA5','2018-06-25');
SELECT * FROM ps_area;
INSERT INTO ps_area (client_id,day_of_data_capture,days_since_last_seen)
VALUES ('GA5','2018-06-25',43);
SELECT * FROM ps_area;
The output:
morphl@cqlsh:morphl> INSERT INTO ps_area (client_id,day_of_data_capture)
... VALUES ('GA5','2018-06-25');
morphl@cqlsh:morphl> SELECT * FROM ps_area;
client_id | day_of_data_capture | days_since_last_seen
-----------+---------------------+----------------------
GA5 | 2018-06-25 | null
GA1 | 2018-07-05 | 4
GA1 | 2018-07-01 | 15
GA2 | 2018-07-10 | 6
GA2 | 2018-07-03 | 25
(5 rows)
morphl@cqlsh:morphl> INSERT INTO ps_area (client_id,day_of_data_capture,days_since_last_seen)
... VALUES ('GA5','2018-06-25',43);
morphl@cqlsh:morphl> SELECT * FROM ps_area;
client_id | day_of_data_capture | days_since_last_seen
-----------+---------------------+----------------------
GA5 | 2018-06-25 | 43
GA1 | 2018-07-05 | 4
GA1 | 2018-07-01 | 15
GA2 | 2018-07-10 | 6
GA2 | 2018-07-03 | 25
(5 rows)
A relational database would have accepted the first INSERT
statement, but would have rejected the second one, claiming a violation of the primary key uniqueness constraint.
Cassandra will not complain about the second INSERT
statement, and will change days_since_last_seen
from null
(non-existent) to 43
.
As a consequence of this design, in Cassandra all INSERT
statements are in fact UPSERT
statements (INSERT
if absent, UPDATE
if present).
Although Cassandra does in principle allow syntax for UPDATE
statements, there is a generally accepted idiomatic practice to only use INSERT
statements.
Assumptions:
a) We have already installed the Cassandra driver for Python:
pip install cassandra-driver
b) We have the necessary environment variables, please make sure that they are now set:
env | grep MORPHL
When inserting into Cassandra from Python, it is good practice to "prepare once, bind in a loop".
This is for both performance and security reasons.
Preparing means parsing the syntax and compiling the INSERT
statement into an intermediate form.
Preparing is a computationally expensive operation that is typically performed once, before entering a loop.
Binding turns the intermediate form generated above into an executable form by supplying actual values to insert.
Binding is a computationally cheap operation that is typically performed inside of a loop.
The following is setup code that is needed before we can execute any operations in Cassandra.
Notice how the Python program takes the IP address of the Cassandra server from the OS environment, and also needs a user/password combination, as well as the name of the Cassandra keyspace to use):
from os import getenv
from cassandra.cluster import Cluster
from cassandra.auth import PlainTextAuthProvider
MORPHL_SERVER_IP_ADDRESS = getenv('MORPHL_SERVER_IP_ADDRESS')
MORPHL_CASSANDRA_USERNAME = 'morphl'
MORPHL_CASSANDRA_PASSWORD = getenv('MORPHL_CASSANDRA_PASSWORD')
MORPHL_CASSANDRA_KEYSPACE = 'morphl'
auth_provider = PlainTextAuthProvider(
username=MORPHL_CASSANDRA_USERNAME, password=MORPHL_CASSANDRA_PASSWORD)
cluster = Cluster(
contact_points=[MORPHL_SERVER_IP_ADDRESS], auth_provider=auth_provider)
session = cluster.connect(MORPHL_CASSANDRA_KEYSPACE)
Now that we have a connected session
object, we prepare a statement, enter a loop, then repeatedly bind some values and execute the insert:
insert_statement = 'INSERT INTO ps_area (client_id,day_of_data_capture,days_since_last_seen) VALUES (?,?,?)'
prep_stmt_for_ps_area_insert = session.prepare(insert_statement)
# Begin Loop
bind_list_for_ps_area_insert = ['GA8','2018-07-15',28]
session.execute(prep_stmt_for_ps_area_insert, bind_list_for_ps_area_insert)
# End Loop
If we now go back and look at the data, this is what we see:
morphl@cqlsh:morphl> SELECT * FROM ps_area;
client_id | day_of_data_capture | days_since_last_seen
-----------+---------------------+----------------------
GA8 | 2018-07-15 | 28
GA5 | 2018-06-25 | 43
GA1 | 2018-07-05 | 4
GA1 | 2018-07-01 | 15
GA2 | 2018-07-10 | 6
GA2 | 2018-07-03 | 25
(6 rows)