Skip to content

neighborhoods/honeycomb

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

honeycomb

honeycomb serves as an abstraction around connections to a Hadoop/Hive-based data lake and other databases, built with extensibility to other services in mind. It handles connections, type conversions, a degree of query generation, and more for users, in order to greatly reduce the degree of technical expertise and Python knowledge required to interact with the NHDS data lake and other data repositories. In doing this, we aim to increase data availability throughout the organization, and facilitate any individual member feeling empowered to interact with data.

Usage

Much of the functionality of honeycomb is completely under-the-hood, and as a result the average use case of it is very straightforward.

General

  1. Supported engines are currently:
    • Runs against data lake:
      1. Hive - Runs against data lake. Queries are slightly slower than Presto, but has effectively has no limit on query size and has advanced recovery/reliability functionality. This is the suggested engine for most queries. Must provide a schema in queries.
      2. Presto - Runs against data lake. Presto runs queries more quickly than other engines, but has query size limitations. Returns all columns as strings. Used for quick, ad-hoc queries, but Hive is recommended over Presto in almost all situations. Must provide a schema in queries.
    • Does not run against data lake
      1. Google BigQuery - Runs against data stored in Google BigQuery. Currently, the only data stored there is clickstream data and cost information on our GA/GBQ accounts. Must provide a project ID, dataset, and table name in queries. NOTE: BigQuery uses different SQL syntax than the lake, and this must be reflected in your queries.
      from honeycomb.bigquery import run_gbq_query
      
      query = "SELECT * from `places-clickstream`.9317736.ga_sessions_20200728 LIMIT 5"
      df = run_gbq_query(query, project_id='places-clickstream')
      
      1. Salesforce - Runs against data stored in the Salesforce Object Manager. NOTE: Salesforce uses SOQL, which has different syntax than the lake.
      from honeycomb import salesforce as hcsf
      
      query = "SELECT Id FROM Lead LIMIT 5"
      df = hcsf.run_sf_query(query)
      
      df_star = hcsf.sf_select_star(object_name='Lead',
                                    where_clause='WHERE CreatedDate <= 2020-08-01T00:00-00:00',
                                    limit_clause='LIMIT 5')
      

Running Queries

The run_lake_query function allows for running queries through either the presto or hive querying engines.

import honeycomb as hc

df0 = hc.run_lake_query('SELECT COUNT(*) FROM experimental.test_table',
                        engine='presto')

df1 = hc.run_lake_query('SELECT COUNT(*) FROM experimental.test_table',
                        engine='hive')

Table Creation

honeycomb only supports table creation using hive as the engine. To create a table in the data lake, all that is required is a dataframe, a table name - it defaults to writing to the 'experimental' zone.

create_table_from_df will infer the file type to create the table with based on the extension of the provided filename. If filename is not provided or does not contain an extension, it will default to Parquet.

NOTE: If your dataframe contains a high degree of nesting, generating DDL for its table can be very time-consuming. It may be faster to generate the table with a subset of the dataframe's rows, and append the remaining rows after the fact.

An empty directory is expected for a table's underlying storage. If files are already present in the S3 path designated to store files for the new table, table creation will fail.

Additional parameters:

  1. schema: The schema to write to. Most users will only have access to write to the 'experimental' zone.
  2. dtypes: A dictionary from column names to data types. honeycomb automatically converts Python/pandas data types into Hive-compliant data types, but this parameter allows for manual override of this. For example, pd.DateTimes are normally converted to hive datetimes, but if specified in this parameter, it could be saved as a string instead.
  3. path: Where the files that this table references should be stored in S3. It defaults to the table's name, and under most circumstances should be kept this way.
  4. filename: The name to store the file containing the DataFrame. If writing to the experimental zone, this is optional, and a name based on a timestamp will be generated. However, when writing to any other lake zone, a specified name is required, in order to ensure that the table's underlying filesystem will be navegable in the future.
  5. table_comment: A plaintext description of what data the table contains and what its purpose is. When writing to the experimental zone, this is optional, but with any other zone it is required.
  6. column_comments: A dictionary from column names to a plaintext description of what the column contains. This is optional when writing to the experimental zone, but with any other zone it is required.
  7. timezones: A dictionary from datetime columns to the timezone they represent. If the column is timezone-naive, it will have the timezone added to its metadata, leaving the times themselves unmodified. If the column is timezone-aware and is in a different timezone than the one that is specified, the column's timezone will be converted, modifying the original times.
  8. copy_df: Whether the operations performed on df should be performed on the original or a copy. Keep in mind that if this is set to False, the original df passed in will be modified as well - twice as memory efficient, but may be undesirable if the df is needed again later
  9. partitioned_by: Dictionary or list of tuples containing a partition name and type. Cannot be a vanilla dictionary if using Python version < 3.6
  10. partition_values: Required if 'partitioned_by' is used and 'auto_upload_df' is True. List of tuples containing partition name and value to store the dataframe under
  11. overwrite: In the case of a table already existing with the specified name, states whether to drop a table and completely remove its underlying files or throw errors. In any other case, this parameter does nothing.
  12. auto_upload_df: Whether the df that the table's structure will be based off of should be automatically uploaded to the table, or just used to generate and execute the DDL.
import pandas as pd
import honeycomb as hc

df = pd.DataFrame({'col1': [1, 2, 3], 'col2': [4, 5, 6]})
hc.create_table_from_df(df, table_name='test_table')

Table Appending

honeycomb only supports using hive as the engine for table appending. To append a DataFrame to a table, all that is needed is the table name. Schema can optionally be provided and defaults to 'experimental'.

import pandas as pd
import honeycomb as hc

df = pd.DataFrame({'col1': [7, 8, 9], 'col2': [10, 11, 12]})
hc.append_table(df, table_name='test_table')

As with create_table_from_df, a filename can be provided as well. If one is not, a name is generated based on a timestamp. However, if writing anywhere other than the experimental zone, a specified filename is required.

Table Describing

honeycomb can be used to obtain information on tables in the lake, such as column names and dtypes, and if include_metadata is set to true, a table description and column comments.

import honeycomb as hc

hc.describe_table('test_table', schema='curated', include_metadata=True)

Session-Level Configuration

honeycomb outputs certain messages to the screen to help interactive users maintain awareness of what is being performed behind-the-scenes. If this is not desirable (as may be the case for notebooks, pipelines, usage of honeycomb within other packages, etc.), all non-logging output can be disabled with hc.set_option('verbose', False). This also sets the corresponding verbose option in rivet to False.