Skip to content
Jerome Banks edited this page Mar 14, 2018 · 1 revision

XUnit UDFs

Brickhouse contains UDF's for parsing and manipulationg XUnits. An XUnit is a string containing multiple YPaths, separated by commas. A YPath is dimension name, followed by one or more attributes ( with a name and value), separated by slashes.

UDF Descriptions

  • append_ypath Append a YPath to a given XUnit.

  • get_all_yp_dims Return all the YPath dimensions of an XUnit as an array

  • get_num_yp_dims Return the number of YPaths in an XUnit. The Global XUnit returns 0.

  • get_ypath_attribute Return an attribute value from an XUnit, specifying the XUnit, the YPath dimension and the attribute name.

  • get_ypath_struct Return a YPath named_struct representation, given an XUnit and a YPath dimension.

  • is_global_xunit Returns true if the XUnit matches the Global XUnit /G .

  • parse_xunit_string Parse a string representation of an XUnit and return a named_struct

  • parse_ypath_string Parse a string representation of a YPath, and return a named_struct

  • print_xunit_struct Print the string representation of an XUnit, given a named_struct

  • print_ypath_struct Print the string representation of a YPath, given a named_struct

  • remove_yp_dim Return an XUnit with the specified YPath dimension removed. If the XUnit doesn't contain that YPath dimension, the same XUnit is returned.

  • contains_only_yp_dims Given an XUnit, and an array of YPath dimensions, return true if and only if the XUnit contains the specified dimensions. If the XUnit doesn't contain all the dimensions, this returns false. If the XUnit contains additional dimensions, this returns false.

  • contains_yp_dim Given an XUnit, and a YPath dimension, return if the XUnit contains the dimension. If other dimensions are in the XUnit, this still returns true.

  • contains_ypath Given an XUnit and a YPath, return true if the XUnit contains the YPath, including the attribute names and values. If the XUnit contains a YPath with the same dimension, but the attribute names or values are differen, then this returns false. If the YPath contains additional attributes, then this returns false.

  • construct_ypath Constructs a YPath named_struct, given a YPath dimension, an array of attribute names, and an array of attribute values.

  • construct_xunit Construct an XUnit from an array of YPath named_structs. If the size of the array is 0, then the Global XUnit is returned

Named_struct and String representations of XUnits and YPaths

To allow for easier manipulation, this library supports two different representations of XUnits and YPaths; As a standard string, ( i.e. '/brand/brand=Adidas' ), and as a Hive named_struct. Once the XUnit is represented as an XUnit, the individual fields can be accessed as struct fields, and via other UDF's in Hive and Brickhouse. ( For example, one could do a LATERAL VIEW EXPLODE on an array of ypaths in an XUnit to access all the attributes.)

The type for a YPath is

  struct<dimension:string,
        attributes:<array<
            struct<attribute_name:string,attribute_value:string>>>

The type for an XUnit is

  struct<is_global:boolean,ypaths<array<
             struct<dimension:string,
                      attributes:<array<
                       struct<attribute_name:string,attribute_value:string>>>>>

The XUnit is essentially an array of YPaths, plus a boolean global flag.

Arguments and return values

When the UDF specifies that an XUnit or an YPath should be passed in, either a string, or a named_struct can be passed in. When a function returns an XUnit or YPath, generally the named_struct represention will be returned. To translate between these . representations, use the print_xunit_struct and parse_xunit_string UDF's.

Invalid XUnits

When an XUnit cannot be parsed, then an HiveException is thrown, generally causing the Hive query to die. This is to enforce that we are dealing with only valid XUnits. If one has a dataset which might contain invalid XUnits, use the is_valid_xunit UDF to filter out the offending XUnits. If null is passed in as an argument, then a null value is returned.

Examples

Hopefully, with these UDF's, it is easier to do analysis on data sets which are keyed by xunits. For example, imagine a dataset with "brand" as one of the YPaths.

A possible view on the data, to just deal with top level brand aggregates might be the following:

CREATE VIEW brand_view
AS
SELECT xunit,
   get_ypath_attribute( xunit, "brand", "brand" ) as brand_ypath,
   get_ypath_struct( xunit, "brand" ) as brand_ypath,
   interaction_count,
   conversion_count,
   as_of
FROM
  funnel_aggregates
WHERE
   contains_only_yp_dims( xunit, array("brand" ) );

This would select only the rows which had a top-level /brand YPath, using the contains_only_yp_dims UDF's. One can then access the brand via the get_ypath_attribute UDF, to get the brand name, or the get_ypath_struct, if one needed the actual YPath.

Also, remember that you can use LATERAL VIEW EXPLODE to better access array elements. Say you had a YPath with multiple layers of attributes, ( like a category tree ). One could expand the YPaths with a query like the following:

DROP VIEW IF EXISTS category_explode_view;
CREATE VIEW category_explode_view
AS
SELECT
   xunit,
   get_ypath_struct( xunit, "category" ) as category_ypath,
   yp.attribute_name,
   yp.attribute_value,
   interaction_count,
   conversion_count,
   as_of
FROM  funnel_aggregates
LATERAL VIEW EXPLODE(  get_ypath_struct( xunit, "category").attributes ) yp1 as YP
WHERE
    contains_yp_dim( xunit, "category" )
;

This way you could examine some of the leaf nodes more easily :

SELECT *
FROM category_explode_view
WHERE attribute_value like '%Books%';