Skip to content

Latest commit

 

History

History
258 lines (198 loc) · 8.11 KB

hll_functions.md

File metadata and controls

258 lines (198 loc) · 8.11 KB

HyperLogLog++ functions

The HyperLogLog++ algorithm (HLL++) estimates cardinality from sketches. If you do not want to work with sketches and do not need customized precision, consider using approximate aggregate functions with system-defined precision.

HLL++ functions are approximate aggregate functions. Approximate aggregation typically requires less memory than exact aggregation functions, like COUNT(DISTINCT), but also introduces statistical uncertainty. This makes HLL++ functions appropriate for large data streams for which linear memory usage is impractical, as well as for data that is already approximate.

ZetaSQL supports the following HLL++ functions:

HLL_COUNT.INIT

HLL_COUNT.INIT(input [, precision])

Description

An aggregate function that takes one or more input values and aggregates them into a HLL++ sketch. Each sketch is represented using the BYTES data type. You can then merge sketches using HLL_COUNT.MERGE or HLL_COUNT.MERGE_PARTIAL. If no merging is needed, you can extract the final count of distinct values from the sketch using HLL_COUNT.EXTRACT.

This function supports an optional parameter, precision. This parameter defines the accuracy of the estimate at the cost of additional memory required to process the sketches or store them on disk. The following table shows the allowed precision values, the maximum sketch size per group, and confidence interval (CI) of typical precisions:

Precision Max. Sketch Size (KiB) 65% CI 95% CI 99% CI
10 1 ±3.25% ±6.50% ±9.75%
11 2 ±2.30% ±4.60% ±6.89%
12 4 ±1.63% ±3.25% ±4.88%
13 8 ±1.15% ±2.30% ±3.45%
14 16 ±0.81% ±1.63% ±2.44%
15 (default) 32 ±0.57% ±1.15% ±1.72%
16 64 ±0.41% ±0.81% ±1.22%
17 128 ±0.29% ±0.57% ±0.86%
18 256 ±0.20% ±0.41% ±0.61%
19 512 ±0.14% ±0.29% ±0.43%
20 1024 ±0.10% ±0.20% ±0.30%
21 2048 ±0.07% ±0.14% ±0.22%
22 4096 ±0.05% ±0.10% ±0.15%
23 8192 ±0.04% ±0.07% ±0.11%
24 16384 ±0.03% ±0.05% ±0.08%

If the input is NULL, this function returns NULL.

For more information, see HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardinality Estimation Algorithm.

Supported input types

INT64, UINT64, NUMERIC, BIGNUMERIC, STRING, BYTES

Return type

BYTES

Example

SELECT
  HLL_COUNT.INIT(respondent) AS respondents_hll,
  flavor,
  country
FROM UNNEST([
  STRUCT(1 AS respondent, "Vanilla" AS flavor, "CH" AS country),
  (1, "Chocolate", "CH"),
  (2, "Chocolate", "US"),
  (2, "Strawberry", "US")])
GROUP BY flavor, country;

HLL_COUNT.MERGE

HLL_COUNT.MERGE(sketch)

Description

An aggregate function that returns the cardinality of several HLL++ set sketches by computing their union.

Each sketch must have the same precision and be initialized on the same type. Attempts to merge sketches with different precisions or for different types results in an error. For example, you cannot merge a sketch initialized from INT64 data with one initialized from STRING data.

This function ignores NULL values when merging sketches. If the merge happens over zero rows or only over NULL values, the function returns 0.

Supported input types

BYTES

Return type

INT64

Example

SELECT HLL_COUNT.MERGE(respondents_hll) AS num_respondents, flavor
FROM (
  SELECT
    HLL_COUNT.INIT(respondent) AS respondents_hll,
    flavor,
    country
  FROM UNNEST([
    STRUCT(1 AS respondent, "Vanilla" AS flavor, "CH" AS country),
    (1, "Chocolate", "CH"),
    (2, "Chocolate", "US"),
    (2, "Strawberry", "US")])
  GROUP BY flavor, country)
GROUP BY flavor;

HLL_COUNT.MERGE_PARTIAL

HLL_COUNT.MERGE_PARTIAL(sketch)

Description

An aggregate function that takes one or more HLL++ sketch inputs and merges them into a new sketch.

This function returns NULL if there is no input or all inputs are NULL.

Supported input types

BYTES

Return type

BYTES

Example

SELECT HLL_COUNT.MERGE_PARTIAL(respondents_hll) AS num_respondents, flavor
FROM (
  SELECT
    HLL_COUNT.INIT(respondent) AS respondents_hll,
    flavor,
    country
  FROM UNNEST([
    STRUCT(1 AS respondent, "Vanilla" AS flavor, "CH" AS country),
    (1, "Chocolate", "CH"),
    (2, "Chocolate", "US"),
    (2, "Strawberry", "US")])
  GROUP BY flavor, country)
GROUP BY flavor;

HLL_COUNT.EXTRACT

HLL_COUNT.EXTRACT(sketch)

Description

A scalar function that extracts a cardinality estimate of a single HLL++ sketch.

If sketch is NULL, this function returns a cardinality estimate of 0.

Supported input types

BYTES

Return type

INT64

Example

SELECT
  flavor,
  country,
  HLL_COUNT.EXTRACT(respondents_hll) AS num_respondents
FROM (
  SELECT
    HLL_COUNT.INIT(respondent) AS respondents_hll,
    flavor,
    country
  FROM UNNEST([
    STRUCT(1 AS respondent, "Vanilla" AS flavor, "CH" AS country),
    (1, "Chocolate", "CH"),
    (2, "Chocolate", "US"),
    (2, "Strawberry", "US")])
  GROUP BY flavor, country);

+------------+---------+-----------------+
| flavor     | country | num_respondents |
+------------+---------+-----------------+
| Vanilla    | CH      | 1               |
| Chocolate  | CH      | 1               |
| Chocolate  | US      | 1               |
| Strawberry | US      | 1               |
+------------+---------+-----------------+

About the HLL++ algorithm

The HLL++ algorithm improves on the HLL algorithm by more accurately estimating very small and large cardinalities. The HLL++ algorithm includes a 64-bit hash function, sparse representation to reduce memory requirements for small cardinality estimates, and empirical bias correction for small cardinality estimates.

About sketches

A sketch is a summary of a large data stream. You can extract statistics from a sketch to estimate particular statistics of the original data, or merge sketches to summarize multiple data streams. A sketch has these features:

  • It compresses raw data into a fixed-memory representation.
  • It's asymptotically smaller than the input.
  • It's the serialized form of an in-memory, sublinear data structure.
  • It typically requires less memory than the input used to create it.

Sketches allow integration with other systems. For example, it is possible to build sketches in external applications, like Cloud Dataflow, or Apache Spark and consume them in ZetaSQL or vice versa. Sketches also allow building intermediate aggregations for non-additive functions like COUNT(DISTINCT).