- Create an Amazon S3 bucket
- Creating Amazon Athena Database and Table
- Signing up for Amazon Quicksight Standard Edition
- Configuring Amazon QuickSight to use Amazon Athena as data source
- Visualizing the data using Amazon QuickSight
Note: If you have already have an S3 bucket in your AWS Account you can skip this section.
- Open the AWS Management console for Amazon S3
- On the S3 Dashboard, Click on Create Bucket.
-
In the Create Bucket pop-up page, input a unique Bucket name. It is advised to choose a large bucket name, with many random characters and numbers (no spaces).
- Select the region as Oregon.
- Click Next to navigate to next tab.
- In the Set properties tab, leave all options as default.
- In the Set permissions tag, leave all options as default.
- In the Review tab, click on Create Bucket
Note: If you have complete the Lab 1: Serverless Analysis of data in Amazon S3 using Amazon Athena you can skip this section and go to the next section Signing up for Amazon Quicksight Standard Edition
Amazon Athena uses Apache Hive to define tables and create databases. Databases are a logical grouping of tables. When you create a database and table in Athena, you are simply describing the schema and location of the table data in Amazon S3. In case of Hive, databases and tables don’t store the data along with the schema definition unlike traditional relational database systems. The data is read from Amazon S3 only when you query the table. The other benefit of using Hive is that the metastore found in Hive can be used in many other big data applications such as Spark, Hadoop, and Presto. With Athena catalog, you can now have Hive-compatible metastore in the cloud without the need for provisioning a Hadoop cluster or RDS instance. For guidance on databases and tables creation refer Apache Hive documentation. The following steps provides guidance specifically for Amazon Athena.
-
In the Create Bucket pop-up page, input a unique Bucket name. It is advised to choose a large bucket name, with many random characters and numbers (no spaces).
- Select the region as Oregon.
- Click Next to navigate to next tab.
- In the Set properties tab, leave all options as default.
- In the Set permissions tag, leave all options as default.
- In the Review tab, click on Create Bucket
- Open the AWS Management Console for Athena.
- If this is your first time visiting the AWS Management Console for Athena, you will get a Getting Started page. Choose Get Started to open the Query Editor. If this isn't your first time, the Athena Query Editor opens.
- Make a note of the AWS region name, for example, for this lab you will need to choose the US West (Oregon) region.
- In the Athena Query Editor, you will see a query pane with an example query. Now you can start entering your query in the query pane.
- To create a database named mydatabase, copy the following statement, and then choose Run Query:
CREATE DATABASE mydatabase
- Ensure mydatabase appears in the DATABASE list on the Catalog dashboard
-
Ensure that current AWS region is US West (Oregon) region
-
Ensure mydatabase is selected from the DATABASE list and then choose New Query.
-
In the query pane, copy the following statement to create a the NYTaxiRides table, and then choose Run Query:
CREATE EXTERNAL TABLE NYTaxiRides (
vendorid STRING,
pickup_datetime TIMESTAMP,
dropoff_datetime TIMESTAMP,
ratecode INT,
passenger_count INT,
trip_distance DOUBLE,
fare_amount DOUBLE,
total_amount DOUBLE,
payment_type INT
)
PARTITIONED BY (YEAR INT, MONTH INT, TYPE string)
STORED AS PARQUET
LOCATION 's3://us-west-2.serverless-analytics/canonical/NY-Pub'
4.Ensure the table you just created appears on the Catalog dashboard for the selected database.
Now that you have created the table you need to add the partition metadata to the Amazon Athena Catalog.
- Choose New Query, copy the following statement into the query pane, and then choose Run Query to add partition metadata.
MSCK REPAIR TABLE NYTaxiRides
The returned result will contain information for the partitions that are added to NYTaxiRides for each taxi type (yellow, green, fhv) for every month for the year from 2009 to 2016
- Open the AWS Management Console for QuickSight.
- If this is the first time you are accessing QuickSight, you will see a sign-up landing page for QuickSight.
- Click on Sign up for QuickSight.
Note: Chrome browser might timeout at this step. If that's the case, try this step in Firefox/Microsoft Edge/Safari.
- On the next page, for the subscription type select the "Standard Edition" and click Continue.
-
On the next page,
i. Enter a unique QuickSight account name.
ii. Enter a valid email for Notification email address.
iii. Just for this step, leave the QuickSight capacity region as N.Virginia.
iv. Ensure that Enable autodiscovery of your data and users in your Amazon Redshift, Amazon RDS and AWS IAM Services and Amazon Athena boxes are checked.
v. Click Finish.
vi. You will be presented with a message Congratulations! You are signed up for Amazon QuickSight! on successful sign up. Click on Go to Amazon QuickSight.
-
Before continuing with the following steps, make sure you are in the N. Virginia Region to edit permissions.
Now, on the Amazon QuickSight dashboard, navigate to User Settings page on the Top-Right section and click Manage QuickSight.
- In this section, click on Security & permissions and then click Add or remove.
- Click on Amazon S3 and on the tab that says S3 buckets linked to QuickSight account.
- Ensure Select All is checked.
- Click on Select buckets.
- Now, select the S3 Buckets you can access across AWS tab on the top right. Make sure Use a different bucket is selected. Insert us-west-2.serverless-analytics as the bucket name and select Add S3 bucket. It should look similar to below:
- When you are done doing all this, click Update to bring you back to the user settings back.
For this lab, you will need to choose the US West (Oregon) region.
-
Click on the region icon on the top-right corner of the page, and select US West (Oregon).
-
Click on Manage data on the top-right corner of the webpage to review existing data sets.
-
Click on New data set on the top-left corner of the webpage and review the options.
-
Select Athena as a Data source.
- Enter the Data source name (e.g. AthenaDataSource).
- Click Create data source.
- Select the mydatabase database.
- Choose the nytaxirides table.
- Choose Edit/Preview data.
This is a crucial step. Please ensure you choose Edit/Preview data.
-
Under Fields on the left column, choose Add calculate field
i. Select the extract operation from Function list.
ii. Select pickup_datetime from the Field list.
iii. For Calculated field name, type hourofday.
iv. Type ‘HH’ so the Formula is extract('HH',{pickup_datetime})
v. Choose Create to add a field which is calculated from an existing field. In this case, the hourofday field is calculated from the pickup_datetime field based on the specified formula.
-
Choose Save and Visualize on top of the page.
Now that you have configured the data source and created a new field to represent the hour of the day, in this section you will filter the data by year followed by month to visualize the taxi data for the entire month of January 2016 based on the pickup_datetime field.
-
Ensure that current AWS region is US West (Oregon) region.
-
Under the Fields List, select the year field to show the distribution of fares per year.
-
To reformat the year without comma
i. Select the dropdown arrow for the year field.
ii. Select Format 1,234.5678 from the dropdown menu.
iii. Select 1235.
-
To add a filter on the year field,
i. Select the dropdown for year field from the Fields list.
ii. Select Add filter to the field from the dropdown menu.
-
To filter the data only for the year 2016
i. Choose the new filter that you just created by clicking on # next to filter name year under the Edit filter menu.
ii. Select Filter list for the two dropdowns under the filter name.
iii. Deselect Select All.
iv. Select only 2016.
v. Click Apply.
vi. Click Close.
-
Ensure that current AWS region is US West(Oregon) region.
-
Select Visualize from the navigation menu in the left-hand corner.
-
Under the Fields list, deselect year by clicking on year field name.
-
Select month by clicking on the month field name from the Fields list.
-
To filter the data set for the month of January (Month 1)
i. Select the dropdown arrow for month field under the Fields List.
ii. Select Add filter to the field.
-
To filter the data for month of January 2016 (Month 1),
i. Choose the new filter that you just created by clicking on # next to filter name month under the Edit Filter menu.
ii. Select Filter list for the two dropdowns under the filter name.
iii. Deselect ALL.
iv. Select only 1.
v. Click Apply
vi. Click Close.
- Select Visualize from the navigation menu in the left-hand corner.
- Under the Fields list, deselect month by clicking on month field name.
- Select hourofday by clicking on the hourofday field name from the Fields list.
- Change the visual type to a line chart by selecting the line chart icon highlighted in the screenshot below under Visual types.
- Using the slider on x-axis, select the entire range [0,23] for hourofday field.
-
Click on the double drop-down arrow underneath your username at the top-right corner of the page to reveal X-axis, Value and Color under Field wells.
-
Under the Fields list, deselect hourofday by clicking on hourofday field name.
-
Select pickup_datetime for x-axis by clicking on the pickup_datetime field name from Fields list.
-
Select type for Color by clicking on the type field name from Fields list.
-
Click on the field name pickup_datetime in x-axis to reveal a sub-menu.
-
Select Aggregate:Day to aggregate by day.
- Using the slider on x-axis, select the entire month of January 2016 for pickup_datetime field.
Note: The interesting outlier in the above graph is that on Jan23rd, 2016, you see the dip in the number of taxis across all types. Doing a quick google search for that date, gets us this weather article from NBC New York
Using Amazon QuickSight, you were able to see patterns across a time-series data by building visualizations, performing ad-hoc analysis, and quickly generating insights.
This library is licensed under the Apache 2.0 License.