Skip to content

Latest commit

 

History

History
107 lines (97 loc) · 6.61 KB

README.md

File metadata and controls

107 lines (97 loc) · 6.61 KB

This is a Spark/Cassandra demo using the open-source Spark Cassandra Connector

There are 2 packages with 2 distinct demos

  • us.unemployment.demo
  • Ingestion
    1. FromCSVToCassandra: read US employment data from CSV file into Cassandra
    2. FromCSVCaseClassToCassandra: read US employment data from CSV file, create case class and insert into Cassandra
  • Read
    1. FromCassandraToRow: read US employment data from Cassandra into CassandraRow low-level object
    2. FromCassandraToCaseClass: read US employment data from Cassandra into custom Scala case class, leveraging the built-in object mapper
    3. FromCassandraToSQL: read US employment data from Cassandra using SparkSQL a the connector integration
  • twitter.stream
  • TwitterStreaming: demo of Twitter stream saved back to Cassandra (stream IN). To make this demo work, you need to start the job with the following info:
        <ol>
            <li>-Dtwitter4j.oauth.consumerKey="value"</li>
            <li>-Dtwitter4j.oauth.consumerSecret="value"</li>
            <li>-Dtwitter4j.oauth.accessToken="value"</li>
            <li>-Dtwitter4j.oauth.accessTokenSecret="value"</li>
        </ol>
        
        If you don't have a Twitter app credentials, create a new apps at <a href="https://apps.twitter.com/" target="_blank">https://apps.twitter.com/</a>
    
  • analytics.music
  • Data preparation
    1. Go to the folder main/data
    2. Execute $CASSANDRA_HOME/bin/cqlsh -f music.cql from this folder. It should create the keyspace spark_demo and some tables
    3. the script will then load into Cassandra the content of performers.csv and albums.csv
  • Scenarios

    All examples extend the `BaseExample` class which configures a SparkContext and truncate some tables automatically for you so that the example can be executed several times and be consistent
    1. Example1 : in this example, we read data from the `performers` table to extract performers and styles into the `performers_by_style` table
    2. Example2 : in this example, we read data from the `performers` table, group styles by performer for aggregation. The results are saved back into the `performers_distribution_by_style` table
    3. Example3 : similar to Example2 we only want to extract the top 10 styles for artists and groups and save the results into the `top10_styles` table
    4. Example4 : in this example, we want to know, for each decade, the number of albums released by each artist, group by their origin country. For this we join the table `performers` with `albums`. The results are saved back into the `albums_by_decade_and_country` table
    5. Example5 : similar to Example4, we perform the join using the SparkSQL language. We also filter out low release count countries. The results are saved back into the `albums_by_decade_and_country_sql` table
  • usecases


    Those scenarios examplify how Spark can be used to achieved various real world use-cases
  • Scenarios
    1. CrossClusterDataMigration : this is a sample code to show how to perform effective cross cluster operations. DO NOT EXECUTE IT
    2. CrossDCDataMigration : this is a sample code to show how to perform effective cross data-centers operations. DO NOT EXECUTE IT
    3. DataCleaningForPerformers : in this scenario, we read data from the ```performers``` table to clean up empty _country_ and reformatting the _born_ and _died_ dates, if present. The data are saved back into Cassandra, thus achieving perfect data locality
    4. DisplayPerformersData : an utility class to show data before and after the cleaning
    5. MigrateAlbumnsData : in this scenario, we read source date from `albums` and save them back into a new table `albums_by_country` purposedly built for fast query on contry and year
  • weather.data.demo
  • Data preparation
    1. Go to the folder main/data
    2. Execute $CASSANDRA_HOME/bin/cqlsh -f weather_data_schema.cql from this folder. It should create the keyspace spark_demo and some tables
    3. Download the Weather_Raw_Data_2014.csv.gz from here (>200Mb)
    4. Unzip it somewhere on your disk
  • Ingestion
    1. WeatherDataIntoCassandra: read all the Weather_Raw_Data_2014.csv file (30.106 lines) and insert the data into Cassandra. It may take some time before the ingestion is done so go take a long coffee ( < 1 hour on my MacBookPro 15") Please do not forget set the path to this file by changing the WeatherDataIntoCassandra.WEATHER_2014_CSV value
    This step should take a while since there are 30.106 lines to be inserted into Cassandra
  • Read
    1. WeatherDataFromCassandra: read all raw weather data plus all weather stations details, filter the data by French station and take data only between March and June 2014. Then compute average on temperature and pressure
    This step should take a while since there are 30.106 lines to be read from Cassandra