Skip to content

Commit

Permalink
Fix SourceScanOptimizer generating duplicated columns (#1494)
Browse files Browse the repository at this point in the history
## Summary

This PR fixes the issue with `SourceScanOptimizer` generating duplicated
columns. The problem was happening because we didn't check for the same
alias referring to different things before merging nodes.

The fix for it was pretty simple, and I added a few tests to catch this
issue.

I recommend reviewing commit by commit :)
  • Loading branch information
serramatutu authored Nov 5, 2024
1 parent bfcc13b commit 44ba096
Show file tree
Hide file tree
Showing 27 changed files with 4,135 additions and 0 deletions.
6 changes: 6 additions & 0 deletions .changes/unreleased/Fixes-20241030-201214.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
kind: Fixes
body: Prevent SourceScanOptimizer from combining nodes that use the same input metric alias in different derived metrics
time: 2024-10-30T20:12:14.516814+01:00
custom:
Author: serramatutu
Issue: "1494"
Original file line number Diff line number Diff line change
Expand Up @@ -817,3 +817,36 @@ metric:
name: listings
filter: "{{ Metric('views', ['listing']) }} > 10"
time_granularity: week
---
# note that in both 1a and 1b "shared_alias" refer to "bookings", whereas
# in 2 it refers to "instant_bookings", hence the naming difference
metric:
name: derived_shared_alias_1a
description: "Minimal repro case for derived metrics which give the same alias for different underlying metrics"
type: derived
type_params:
expr: shared_alias - 10
metrics:
- name: bookings
alias: shared_alias
---
metric:
name: derived_shared_alias_1b
description: "Minimal repro case for derived metrics which give the same alias for different underlying metrics"
type: derived
type_params:
expr: shared_alias - 100
metrics:
- name: bookings
alias: shared_alias
---
metric:
name: derived_shared_alias_2
description: "Minimal repro case for derived metrics which give the same alias for different underlying metrics"
type: derived
type_params:
expr: shared_alias + 10
metrics:
- name: instant_bookings
alias: shared_alias
---
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,9 @@
'listing__booking__listing__bookings_per_dollar',
'listing__booking__listing__derived_bookings_0',
'listing__booking__listing__derived_bookings_1',
'listing__booking__listing__derived_shared_alias_1a',
'listing__booking__listing__derived_shared_alias_1b',
'listing__booking__listing__derived_shared_alias_2',
'listing__booking__listing__discrete_booking_value_p99',
'listing__booking__listing__double_counted_delayed_bookings',
'listing__booking__listing__instant_booking_fraction_of_max_value',
Expand Down Expand Up @@ -90,6 +93,9 @@
'listing__created_at__year',
'listing__derived_bookings_0',
'listing__derived_bookings_1',
'listing__derived_shared_alias_1a',
'listing__derived_shared_alias_1b',
'listing__derived_shared_alias_2',
'listing__discrete_booking_value_p99',
'listing__double_counted_delayed_bookings',
'listing__ds__day',
Expand Down Expand Up @@ -441,6 +447,9 @@
'user__listing__user__bookings_per_view',
'user__listing__user__derived_bookings_0',
'user__listing__user__derived_bookings_1',
'user__listing__user__derived_shared_alias_1a',
'user__listing__user__derived_shared_alias_1b',
'user__listing__user__derived_shared_alias_2',
'user__listing__user__discrete_booking_value_p99',
'user__listing__user__double_counted_delayed_bookings',
'user__listing__user__instant_booking_fraction_of_max_value',
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,9 @@ Model Join-Path Entity Links
('listings_latest',) ("('listing',)", "('booking', 'listing')") bookings_per_dollar ['JOINED', 'METRIC']
('listings_latest',) ("('listing',)", "('booking', 'listing')") derived_bookings_0 ['JOINED', 'METRIC']
('listings_latest',) ("('listing',)", "('booking', 'listing')") derived_bookings_1 ['JOINED', 'METRIC']
('listings_latest',) ("('listing',)", "('booking', 'listing')") derived_shared_alias_1a ['JOINED', 'METRIC']
('listings_latest',) ("('listing',)", "('booking', 'listing')") derived_shared_alias_1b ['JOINED', 'METRIC']
('listings_latest',) ("('listing',)", "('booking', 'listing')") derived_shared_alias_2 ['JOINED', 'METRIC']
('listings_latest',) ("('listing',)", "('booking', 'listing')") discrete_booking_value_p99 ['JOINED', 'METRIC']
('listings_latest',) ("('listing',)", "('booking', 'listing')") double_counted_delayed_bookings ['JOINED', 'METRIC']
('listings_latest',) ("('listing',)", "('booking', 'listing')") instant_booking_fraction_of_max_value ['JOINED', 'METRIC']
Expand Down Expand Up @@ -94,6 +97,9 @@ Model Join-Path Entity Links
('listings_latest',) ("('listing',)", "('listing',)") bookings_per_view ['JOINED', 'METRIC']
('listings_latest',) ("('listing',)", "('listing',)") derived_bookings_0 ['JOINED', 'METRIC']
('listings_latest',) ("('listing',)", "('listing',)") derived_bookings_1 ['JOINED', 'METRIC']
('listings_latest',) ("('listing',)", "('listing',)") derived_shared_alias_1a ['JOINED', 'METRIC']
('listings_latest',) ("('listing',)", "('listing',)") derived_shared_alias_1b ['JOINED', 'METRIC']
('listings_latest',) ("('listing',)", "('listing',)") derived_shared_alias_2 ['JOINED', 'METRIC']
('listings_latest',) ("('listing',)", "('listing',)") discrete_booking_value_p99 ['JOINED', 'METRIC']
('listings_latest',) ("('listing',)", "('listing',)") double_counted_delayed_bookings ['JOINED', 'METRIC']
('listings_latest',) ("('listing',)", "('listing',)") instant_booking_fraction_of_max_value ['JOINED', 'METRIC']
Expand Down Expand Up @@ -156,6 +162,9 @@ Model Join-Path Entity Links
('listings_latest',) ("('user',)", "('listing', 'user')") bookings_per_view ['JOINED', 'METRIC']
('listings_latest',) ("('user',)", "('listing', 'user')") derived_bookings_0 ['JOINED', 'METRIC']
('listings_latest',) ("('user',)", "('listing', 'user')") derived_bookings_1 ['JOINED', 'METRIC']
('listings_latest',) ("('user',)", "('listing', 'user')") derived_shared_alias_1a ['JOINED', 'METRIC']
('listings_latest',) ("('user',)", "('listing', 'user')") derived_shared_alias_1b ['JOINED', 'METRIC']
('listings_latest',) ("('user',)", "('listing', 'user')") derived_shared_alias_2 ['JOINED', 'METRIC']
('listings_latest',) ("('user',)", "('listing', 'user')") discrete_booking_value_p99 ['JOINED', 'METRIC']
('listings_latest',) ("('user',)", "('listing', 'user')") double_counted_delayed_bookings ['JOINED', 'METRIC']
('listings_latest',) ("('user',)", "('listing', 'user')") instant_booking_fraction_of_max_value ['JOINED', 'METRIC']
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -76,6 +76,9 @@
'company__listing__user__company__bookings_per_view',
'company__listing__user__company__derived_bookings_0',
'company__listing__user__company__derived_bookings_1',
'company__listing__user__company__derived_shared_alias_1a',
'company__listing__user__company__derived_shared_alias_1b',
'company__listing__user__company__derived_shared_alias_2',
'company__listing__user__company__discrete_booking_value_p99',
'company__listing__user__company__double_counted_delayed_bookings',
'company__listing__user__company__instant_booking_fraction_of_max_value',
Expand Down Expand Up @@ -151,6 +154,9 @@
'guest__booking__guest__bookings_per_dollar',
'guest__booking__guest__derived_bookings_0',
'guest__booking__guest__derived_bookings_1',
'guest__booking__guest__derived_shared_alias_1a',
'guest__booking__guest__derived_shared_alias_1b',
'guest__booking__guest__derived_shared_alias_2',
'guest__booking__guest__discrete_booking_value_p99',
'guest__booking__guest__double_counted_delayed_bookings',
'guest__booking__guest__instant_booking_fraction_of_max_value',
Expand Down Expand Up @@ -186,6 +192,9 @@
'guest__bookings_per_dollar',
'guest__derived_bookings_0',
'guest__derived_bookings_1',
'guest__derived_shared_alias_1a',
'guest__derived_shared_alias_1b',
'guest__derived_shared_alias_2',
'guest__discrete_booking_value_p99',
'guest__double_counted_delayed_bookings',
'guest__instant_booking_fraction_of_max_value',
Expand Down Expand Up @@ -232,6 +241,9 @@
'host__booking__host__bookings_per_dollar',
'host__booking__host__derived_bookings_0',
'host__booking__host__derived_bookings_1',
'host__booking__host__derived_shared_alias_1a',
'host__booking__host__derived_shared_alias_1b',
'host__booking__host__derived_shared_alias_2',
'host__booking__host__discrete_booking_value_p99',
'host__booking__host__double_counted_delayed_bookings',
'host__booking__host__instant_booking_fraction_of_max_value',
Expand Down Expand Up @@ -267,6 +279,9 @@
'host__bookings_per_dollar',
'host__derived_bookings_0',
'host__derived_bookings_1',
'host__derived_shared_alias_1a',
'host__derived_shared_alias_1b',
'host__derived_shared_alias_2',
'host__discrete_booking_value_p99',
'host__double_counted_delayed_bookings',
'host__instant_booking_fraction_of_max_value',
Expand Down Expand Up @@ -314,6 +329,9 @@
'listing__booking__listing__bookings_per_dollar',
'listing__booking__listing__derived_bookings_0',
'listing__booking__listing__derived_bookings_1',
'listing__booking__listing__derived_shared_alias_1a',
'listing__booking__listing__derived_shared_alias_1b',
'listing__booking__listing__derived_shared_alias_2',
'listing__booking__listing__discrete_booking_value_p99',
'listing__booking__listing__double_counted_delayed_bookings',
'listing__booking__listing__instant_booking_fraction_of_max_value',
Expand Down Expand Up @@ -362,6 +380,9 @@
'listing__created_at__extract_year',
'listing__derived_bookings_0',
'listing__derived_bookings_1',
'listing__derived_shared_alias_1a',
'listing__derived_shared_alias_1b',
'listing__derived_shared_alias_2',
'listing__discrete_booking_value_p99',
'listing__double_counted_delayed_bookings',
'listing__ds__day',
Expand Down Expand Up @@ -426,6 +447,9 @@
'lux_listing__listing__lux_listing__bookings_per_view',
'lux_listing__listing__lux_listing__derived_bookings_0',
'lux_listing__listing__lux_listing__derived_bookings_1',
'lux_listing__listing__lux_listing__derived_shared_alias_1a',
'lux_listing__listing__lux_listing__derived_shared_alias_1b',
'lux_listing__listing__lux_listing__derived_shared_alias_2',
'lux_listing__listing__lux_listing__discrete_booking_value_p99',
'lux_listing__listing__lux_listing__double_counted_delayed_bookings',
'lux_listing__listing__lux_listing__instant_booking_fraction_of_max_value',
Expand Down Expand Up @@ -571,6 +595,9 @@
'user__listing__user__bookings_per_view',
'user__listing__user__derived_bookings_0',
'user__listing__user__derived_bookings_1',
'user__listing__user__derived_shared_alias_1a',
'user__listing__user__derived_shared_alias_1b',
'user__listing__user__derived_shared_alias_2',
'user__listing__user__discrete_booking_value_p99',
'user__listing__user__double_counted_delayed_bookings',
'user__listing__user__instant_booking_fraction_of_max_value',
Expand Down
14 changes: 14 additions & 0 deletions metricflow/dataflow/nodes/compute_metrics.py
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@ class ComputeMetricsNode(DataflowPlanNode):

def __post_init__(self) -> None: # noqa: D105
super().__post_init__()

assert len(self.parent_nodes) == 1

@staticmethod
Expand Down Expand Up @@ -97,6 +98,19 @@ def can_combine(self, other_node: ComputeMetricsNode) -> Tuple[bool, str]:
if other_node.for_group_by_source_node != self.for_group_by_source_node:
return False, "one node is a group by metric source node"

alias_to_metric_spec = {spec.alias: spec for spec in self.metric_specs if spec.alias is not None}

for spec in other_node.metric_specs:
if (
spec.alias is not None
and spec.alias in alias_to_metric_spec
and alias_to_metric_spec[spec.alias] != spec
):
return (
False,
f"Alias '{spec.alias}' is defined in both nodes but it refers to different things in each of them",
)

return True, ""

def with_new_parents(self, new_parent_nodes: Sequence[DataflowPlanNode]) -> ComputeMetricsNode: # noqa: D102
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -333,6 +333,60 @@ def test_derived_metric_with_non_derived_metric(
)


@pytest.mark.sql_engine_snapshot
def test_derived_metric_same_alias_components_combined(
request: FixtureRequest,
mf_test_configuration: MetricFlowTestConfiguration,
dataflow_plan_builder: DataflowPlanBuilder,
) -> None:
"""Tests optimization of querying 2 metrics which give the same alias to the same thing in their components.
In this case we DO combine source nodes, since the components are the same exact thing so we don't need to
scan over it twice
"""
check_optimization(
request=request,
mf_test_configuration=mf_test_configuration,
dataflow_plan_builder=dataflow_plan_builder,
query_spec=MetricFlowQuerySpec(
metric_specs=(
MetricSpec(element_name="derived_shared_alias_1a"),
MetricSpec(element_name="derived_shared_alias_1b"),
),
dimension_specs=(DimensionSpec(element_name="is_instant", entity_links=(EntityReference("booking"),)),),
),
expected_num_sources_in_unoptimized=2,
expected_num_sources_in_optimized=1,
)


@pytest.mark.sql_engine_snapshot
def test_derived_metric_same_alias_components_not_combined(
request: FixtureRequest,
mf_test_configuration: MetricFlowTestConfiguration,
dataflow_plan_builder: DataflowPlanBuilder,
) -> None:
"""Tests optimization of querying 2 metrics which give the same alias different things in their components.
In this case we should NOT combine source nodes, since this would generate two columns with
the same alias.
"""
check_optimization(
request=request,
mf_test_configuration=mf_test_configuration,
dataflow_plan_builder=dataflow_plan_builder,
query_spec=MetricFlowQuerySpec(
metric_specs=(
MetricSpec(element_name="derived_shared_alias_1a"),
MetricSpec(element_name="derived_shared_alias_2"),
),
dimension_specs=(DimensionSpec(element_name="is_instant", entity_links=(EntityReference("booking"),)),),
),
expected_num_sources_in_unoptimized=2,
expected_num_sources_in_optimized=2,
)


@pytest.mark.sql_engine_snapshot
def test_2_ratio_metrics_from_1_semantic_model(
request: FixtureRequest,
Expand Down
28 changes: 28 additions & 0 deletions tests_metricflow/integration/test_cases/itest_metrics.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -551,6 +551,34 @@ integration_test:
ds
) a
---
integration_test:
name: shared_alias_derived_metric_same
description: Tests derived metrics which use the same alias for the same thing
model: SIMPLE_MODEL
metrics: ["derived_shared_alias_1a", "derived_shared_alias_1b"]
group_bys: ["booking__is_instant"]
check_query: |
SELECT
is_instant AS booking__is_instant
, COUNT(*) - 10 AS derived_shared_alias_1a
, COUNT(*) - 100 AS derived_shared_alias_1b
FROM {{ source_schema }}.fct_bookings
GROUP BY 1
---
integration_test:
name: shared_alias_derived_metric_different
description: Tests derived metrics which use the same alias for different things
model: SIMPLE_MODEL
metrics: ["derived_shared_alias_1a", "derived_shared_alias_2"]
group_bys: ["booking__is_instant"]
check_query: |
SELECT
is_instant AS booking__is_instant
, COUNT(*) - 10 AS derived_shared_alias_1a
, SUM(CASE WHEN is_instant THEN 1 ELSE 0 END) + 10 AS derived_shared_alias_2
FROM {{ source_schema }}.fct_bookings
GROUP BY 1
---
integration_test:
name: two_metrics_with_null_dimension_values
description: Tests querying two metrics with a dimension having a NULL values
Expand Down
26 changes: 26 additions & 0 deletions tests_metricflow/query_rendering/test_derived_metric_rendering.py
Original file line number Diff line number Diff line change
Expand Up @@ -795,3 +795,29 @@ def test_offset_to_grain_metric_filter_and_query_have_different_granularities(
dataflow_plan_builder=dataflow_plan_builder,
query_spec=query_spec,
)


@pytest.mark.sql_engine_snapshot
def test_derived_metric_that_defines_the_same_alias_in_different_components(
request: FixtureRequest,
mf_test_configuration: MetricFlowTestConfiguration,
dataflow_plan_builder: DataflowPlanBuilder,
query_parser: MetricFlowQueryParser,
sql_client: SqlClient,
dataflow_to_sql_converter: DataflowToSqlQueryPlanConverter,
) -> None:
"""Tests querying a derived metric which give the same alias to its components."""
query_spec = query_parser.parse_and_validate_query(
metric_names=("derived_shared_alias_1a", "derived_shared_alias_2"),
group_by_names=("booking__is_instant",),
limit=1,
).query_spec

render_and_check(
request=request,
mf_test_configuration=mf_test_configuration,
dataflow_to_sql_converter=dataflow_to_sql_converter,
sql_client=sql_client,
dataflow_plan_builder=dataflow_plan_builder,
query_spec=query_spec,
)
Loading

0 comments on commit 44ba096

Please sign in to comment.