Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SL-1290] Conversion Metrics #252

Closed
nhandel opened this issue Sep 19, 2022 · 16 comments · Fixed by #916
Closed

[SL-1290] Conversion Metrics #252

nhandel opened this issue Sep 19, 2022 · 16 comments · Fixed by #916
Assignees
Labels
backlog enhancement New feature or request linear multiple_requests Used when multiple users have asked for a feature/enhancement

Comments

@nhandel
Copy link
Collaborator

nhandel commented Sep 19, 2022

Describe the Feature
A new conversion metric type would be useful. A conversion is when a base event happens and then a conversion event happens for a specific entity within some time range. For example, we might want to measure how often a user(entity) completes a visit(base event) and then purchase(conversion event) with 7 days(time window). This requires a time range and an entity to join on. Today the closest alternative is a ratio metric but that does not include the entity in the pre-aggregated join.

This could be defined using the following spec:

metric:
  name: visit_to_purchase_7d
  owners:
    - [email protected]
  type: conversion
  type_params:
    entity: user
    base_data_source: visits # required
    base_params: # all optional
      time_dimension: created_at # default: primary time
    conversion_data_source: transactions # required
    conversion_params: # all optional
      time_dimension: created_at # default: primary time
      constraint: status = 'completed' # default: none
      role: buyer # default: none. This is needed for disambiguation when there are two roles for the same entity.
    window: 7d # default: inf. window to join the two events on. 
    multiple_conversions: false # default: false. Is this true/false or a count of how many conversions happen? default: false

would yield SQL that doesn't look like this but accomplishes the same thing:

SELECT
  DATE_TRUNC(visits.created_at, 'day') AS created_at
  COUNT(distinct transaction_id) / COUNT(visit_id) AS conversion_rate
FROM
  visits
LEFT JOIN
  transactions
ON
  visits.created_at - transactions.created_at BETWEEN 0 AND 7 days
  AND visits.user_id = transactions.user_id
WHERE
  transactions.status = 'completed'

A few notes about query parameters:

  • [NEEDS DISCUSSION] For time, you would use the base_events time as the primary time for this metric because that is when the conversion started. You can still request either time dimension in the API. Asking for both dimensions would allow you to create cohort metrics.
  • [NEEDS DISCUSSION] For mutability, because of the note above, a conversion metric cannot be treated as mutable until the end of it's window has passed. In the case of an infinite window, the metric is permanently mutable and should be expected to change.
  • For time over time, this would behave like any other metric. Metricflow would calculate those two conversion rates for the specified granularity and then take a ratio of the two time period.
  • For granularity, this would behave like any other metric. You would take the values across the entire window of time and then aggregated to that granularity.
  • For filters, this would behave like any other metric with multiple data sources (such as ratio metrics).
  • For grouping, this should apply to the base_event as that is the most common implementation to segment a conversion metric.

Would you like to contribute?
I'd love to contribute to design!

Anything Else?
Nope! This would be a stellar feature and metricFlow is designed to make it happen.

SL-1290

@nhandel nhandel added enhancement New feature or request triage Tasks that need to be triaged multiple_requests Used when multiple users have asked for a feature/enhancement and removed triage Tasks that need to be triaged labels Sep 19, 2022
@sid-patreon
Copy link

Hey @nhandel - Funnels can get a little more complex with multi-step funnels, duplicate events, measuring time to conversion, holding certain properties constant, etc. It may be worth looking into How Amplitude computes funnels as you start introducing this feature in metric flow

@lchircus
Copy link

Have you considered what time of time grains you'd support? In particular, seconds, minutes, and days seem most useful.

Also, how would you handle it if a user had multiple base or conversion events within the specified time window?

@nhandel
Copy link
Collaborator Author

nhandel commented Sep 22, 2022

Thanks for sending that over, Sid!
Here are some thoughts:

  • Multi-step funnels - This one is essential. You can always define multiple conversion events if you want to track multiple steps separately. If you want to filter a conversion metrics base event down to only users who completed the previous conversion in the funnel, then we would need to be able to specify a param in the base event that should be tied to another conversion metrics completion. That could look something like this:
      base_params: # all optional
        conversion_from: previous_metric_name
    
  • Multiple Base Events - This one is interesting. We could add a parameter to the spec that deduplicates events across some window and gives a strategy for which event to use (first/last). For example, you could specify that you want the first event in a 1 hour / 1 day / 7-day window. Are there any other strategies to think about here?
  • Multiple Conversion Events - I believe this is solved by the multiple_conversions param above. I may suggest moving this in the conversion_params and matching the names with the params for multiple base events.
  • Time Grain - It should be straightforward to support any time granularity data warehouses support. I'd imagine sub-second is unnecessary to probably seconds through years.
  • Grouping - I added a note above that we should apply segmentation/group by at the base event. That makes sense.
  • Simultaneous Events - Ensuring that if Event B happens in the same second, this is still considered a conversion Event A. This could be configurable. Depending on the logging system, I could imagine this needing to be even more than a second.
  • Duplicate Events - Since this is in the data warehouse, I expect people to do whatever deduplication they deem appropriate. If MetricFlow implemented some basic rules, it would need to be explicit to turn it on and nearly 100% accurate. They can also always use the multiple base event parameters described above.
  • Holding properties constant - This should be possible with constraints. Constraints can come from any data source and include more complicated SQL expressions of other dimensions and metrics.

Are there any other thoughts from the Amplitude doc or anything else we should think about?

@sid-patreon
Copy link

Re multi-step:
Yep was referring to having the flexibility to define optional dependency on prior events. Some of the possibilities listed were This order and Exact order.

Re duplicate events: both base and conversion
I do think this is a common occurrence so having some intentional params in the config to handle these will be good, such as

  • first (default for base event)
  • last (default for conversion event)
  • arbitrary (when it doesn't matter)

Re time grains
In addition to supporting various time grains, I think normalizing the time grains will be important. For e.g. 1 day = 24 hours as opposed to calendar days, similarly weeks and months so that it's agnostic to timezones and hard boundary cut off's (this I think is mostly an implementation detail)

Re holding properties constant
This is more for attribution purposes when a user can have multiple conversion events in a day. E.g. First touch attribution and things like session_id or page_id are probably examples of such properties.
In terms of implementation, similar to segments I think typically you'd use the property value on the base event to hold constant and you would add this to your grouping columns.

The other things I'd say that would be useful -

  • Measuring time to conversion
  • Counting users and frequency of conversion.
  • Having a time series for conversion

@nhandel
Copy link
Collaborator Author

nhandel commented Sep 27, 2022

Multi-step:
Perfect. I think being able to reference another conversion metric is a reasonable solution. Funnels seem like a higher level abstraction than a conversion metric, so I may punt on that for now. I could see us defining a set of events as a funnel and then programmatically building a collection of conversion metrics from that funnel.

duplicate events:
It makes sense to have some basic strategies here. I don't see why someone would override the defaults with arbitrary. That seems like it adds an element of randomness that I don't love for metrics. What is the purpose?

time grains
Definitely. MetricFlow is great at time granularities

holding properties constant
Very useful! Something like this?

type_params:
    entity: user
    base_data_source: visits # required
    base_params: # all optional
      time_dimension: created_at # default: primary time
      constant_properties:
        - name: session
           expr: session_id
    conversion_data_source: transactions # required
    conversion_params: # all optional
      time_dimension: created_at # default: primary time
      constant_properties:
        - name: session
           expr: id_session
    window: 7d # default: inf. window to join the two events on. 

Measuring time to conversion, Counting users and frequency of conversion
This leads me to believe there are enough similar but slightly different metrics to consider adding a metric_subtype. The options would be:

  • ratio [default] - ratio of converted events over base events
  • count - count of conversions
  • time - difference in timestamps (this would also need an agg function like min, max, avg, p(95), etc. )

This could look something like this:

metrics:
  ...
  type: conversion
  type_params: 
    subtype: time
    subtype_params:
      agg: avg
  ...

Having a time series for conversion
Is it correct that this is just the conversion metric by time?

This is turning into a pretty significant metric type!

@lchircus
Copy link

Yes to all the different subtypes! In particular, the time based ones with options like P95/min/max open a wide array of metrics like page performance, customer service SLAs, etc

@nhandel
Copy link
Collaborator Author

nhandel commented Sep 29, 2022

Okay! Thank you both for all the feedback! One more thing...

Retention
Another thing (😅) we should support is the ability to do the inverse of conversion (retention/churn) metrics using the same interface. For conversion ratios, you would probably want 1 - conversion_count/base_count. For conversion counts, you would probably want base_counts-conversion_counts. For conversion time, you would probably want agg_function(max(min(time_since_base_event, conversion_window), time_to_convert)). I don't think the inverse of multiple conversions makes sense, but I could be wrong. I think the parameter could be a boolean called retention (optional, defaults to false).

Querying
Currently, MetricFlow uses metric_time as an indicator of the metrics' primary time. Per the original post, there are now two potential values for metric_time: base times and conversion times We may want to add three additional fields to give more flexibility in requesting these values:

  • metric_time_base - the time of the base event (and the same as metric_time by default).
  • metric_time_conversion - the time of the conversion event.
  • metric_time_to_conversion - The difference between the base and conversion timestamps.

Here is the spec after all of those features:

metric:
  name: _string_ # required
  description: _string_ # optional
  owners: _list[emails]_ # optional
  type: conversion # required
  type_params:
    entity: _entity_ # required
    subtype: _time, count or ratio_ # optional. default: ratio
    subtype_params: # optional. default: None. None for  ratio or count
      agg: _agg_ # Support all agg types in [MetricFlow](https://docs.transformdata.io/docs/metricflow/reference/data-sources/measures#aggregation-required) 
    base_data_source: _data_source_ # required
    base_params: # optional. default None 
      conversion_from: _metric_ # optional. default: None Only for base params
      time_dimension: _time_dimension_ # optional. default: primary time of base data source
      constant_properties: # optional. default None 
        - name: _string_ # required: property alias must match a property in conversion_params
           expr: _sql_expr_ # required. expr in base data_source
      constraint:  _sql_expr_ # optional. default: none
      role: _entity_role_ # optional. default: none. This is needed for disambiguation when there are two roles for the same entity i.e. a user can be a buyer and seller
      dedupe: _first or last_ # optional. default: first. options are first or last but possibly more complicated approaches in the future
      dedupe_params:
        window: _time_ # optional. default: 1s.
    conversion_data_source: transactions # required
    conversion_params: # optional. default: None
      time_dimension: _time_dimension_ # optional. default: primary time of base data source
      constant_properties:
        - name: _string_ # required: property alias must match a property in conversion_params
           expr: _sql_expr_ # required. expr in base data_source
      constraint: _sql_expr_' # optional. default: none
      role: _entity_role_ # optional. default: none.
      dedupe: _first or last_ # optional. default: last. 
      dedupe_params:
        window: _time_ # optional. default: 1s.
    window: 7d # optional. default: inf. window to join the two events on. 
    retention: _boolean_ # optional. default: false.
    multiple_conversions: _boolean_ # optional. default: false. Is this true/false or a count of how many conversions happen? default: false

There is a ton of optionality in there which leads to a significant number of params but I would not expect these all to be used at the same time. As an example a simple conversion metric would look like this:

metric:
  name: signup_to_buy_7d
  description: Percent converting from signup to make a purchase within 7d
  owners: 
    - [email protected]
  type: conversion
  type_params:
    entity: user
    base_data_source: user_signups
    conversion_data_source: transactions
    conversion_params:
      constraint: is_complete
    window: 7d

And a much more complicated metric might look like:

metric:
  name: avg_time_to_buy_in_session
  description: Percent converting from signup to make a purchase within 7d
  owners: 
    - [email protected]
  type: conversion
  type_params:
    entity: user
    sub_type: time
    sub_type_params:
      agg: average 
    base_data_source: sessions
    base_params:
      time_dimension: session_start_at
      constant_properties:
        - name: session
           expr: session_id
       dedupe: first
       dedupe_params
           window: 24h
    conversion_data_source: transactions
    conversion_params:
      constraint: is_complete
      constant_properties:
        - name: session
           expr: session_id
    window: 28d # to avoid mutability after 28 days 

One small note on Funnels - @tlento and I had a chat about funnels, and we think conversion metrics are a lower-level abstraction. Ultimately this gives tons of control over what you build, but it may not be as simple and easy as just defining a funnel and programmatically building a set of metrics for that funnel. this puts us in a good place to add that down the line.

@lchircus
Copy link

lchircus commented Oct 4, 2022

metric_time_base makes a lot more sense to me than metric_time_conversion. For the latter, how would that be represented for the 2 examples above?

Also, I'm unclear - does this proposal supports percentile metrics? This is not as urgently needed on our end but necessary eventually.

@nhandel
Copy link
Collaborator Author

nhandel commented Oct 7, 2022

Okay! Here is what I'm thinking of in terms of ordering these features:

V1 - The Core

  • Converting Entity: entity
  • Subtypes: ratio
  • Data Sources: base_data_source, conversion_data_source
  • Conversion/Base Params: time_dimension, constraint, and role
  • Conversion Window: window
  • Multiple Conversions: multiple_conversions

V2 - Common Metric Variations

  • Subtypes: time and count
  • Retention: retention

V3 - Additional Control

  • Constant Properties: constant_properties
  • Conversion From: conversion_from
  • Dedupe Strategies: dedupe and dedupe_params

I added Percentiles separately! Here is a proposed design. Fairly simple. This could be a good first issue or a quick one for our team. #274.

@lchircus
Copy link

lchircus commented Oct 7, 2022

@nhandel The core makes sense; however, I think V2 and V3 should be swapped in priority because the ability to hold properties constant is very valuable. For example, a metric like 'New product with 1st purchase in 7 days' requires holding the product_id constant

@nhandel
Copy link
Collaborator Author

nhandel commented Oct 7, 2022

Noted! We can make that the next feature up after the core work then.

@joezein
Copy link

joezein commented Nov 15, 2022

updates here @nhandel ?

@joezein
Copy link

joezein commented Nov 15, 2022

would love to be able to calculate churn soon-ish and retention cohorts

@nhandel
Copy link
Collaborator Author

nhandel commented Nov 18, 2022

Based on where we're at, I think this PR will be up by tomorrow. There were a few tricky parts here that required a few more design conversations with @WilliamDee. Specifically, handling the foreign-foreign joins and the subsequent fan out/deduplication of events and adding support for window functions.

We decided that rather than pointing these configs directly at data sources and defining count measures in this metric, we would reuse the existing measure abstraction in data sources. This allows us to enable features we didn't expect to release in the initial version for free. It's also cleaner to introduce fewer new abstractions and reuse the existing measure construction logic. This keeps the spec a bit simpler.

So, our new proposal is that the conversion metric spec will look like this initially:

metric:
  name: _string_ # required
  description: _string_ # optional
  owners: _list[emails]_ # optional
  type: conversion # required
  type_params:
    entity: _entity_ # default: time_spine (similar to a ratio metric)
	calculation: _calculation_type_ # default: conversion_rate. options: conversions(buys) or conversion_rate (buys/visits)
    base_measure: _measure_ # required
    conversion_measure: _measure_ # required
    window: _time_window_ # optional. default: inf. window to join the two events on. Follows similar format as time windows elsewhere (i.e. 7d)
	constant_properties:
	  - name: _string_ # optional.
        base_expr: _dimension_ # optional. Even if name above is expressed. defaults to name above. solves for role also
    	conversion_expr: _dimension_ # same as base above

and then evolve into something close to this as we add more deduplication and calculation functionality:

metric:
  name: _string_ # required
  description: _string_ # optional. default: none
  owners: _list[emails]_ # optional. default: none
  type: conversion # required
  type_params:
    conversion_from: _metric_ # optional. default: none
    entity: _entity_ # default: time_spine (similar to a ratio metric)
	calculation: _calculation_type_ # default: conversion_rate. options: conversion(buys), retention(visits-buys), conversion_rate(buys/visits), retention_rate([visits-buys]/visits), time
    base_measure: _measure_ # required
    base_measure_params: # optional. default: none
      dedupe_strategy: #optional. Options: first, last, arbitrary
      dedupe_window: #optional.
    conversion_measure: _measure_ # required
    conversion_measure_params: # optional. default: none
      dedupe_strategy: #optional. tbd
      dedupe_window: #optional.
    window: _time_window_ # optional. default: inf. window to join the two events on. Follows similar format as time windows elsewhere (i.e. 7d)
	constant_properties:
	  - name: _string_ # optional.
	    base_expr: _dimension_ # optional. Even if name above is expressed. defaults to name above. solves for role also
		conversion_expr: _dimension_ # same as base above

@Jstein77 Jstein77 changed the title Conversion Metrics [SL-1290] Conversion Metrics Nov 28, 2023
@Jstein77 Jstein77 added High priority Created by Linear-GitHub Sync Metricflow Created by Linear-GitHub Sync Metricflow Gap Created by Linear-GitHub Sync labels Nov 28, 2023
@Jstein77
Copy link
Contributor

Jstein77 commented Dec 7, 2023

Release order:

  1. Core backport to go out. Needs to work with DSI current (0.4) - Week of Dec 18th
  2. Backport DSI change to the current release and roll out a new patch release Done
  3. Once core backport goes out, we roll out conversion metrics in the next release

@Jstein77 Jstein77 removed the High priority Created by Linear-GitHub Sync label Dec 7, 2023
@WilliamDee
Copy link
Contributor

The new finalized spec to be implemented for the initial core of this feature, is now specified here dbt-labs/dbt-semantic-interfaces#210

metric:
  ...
  type: conversion # new type
  type_params:
    conversion_type_params:
      entity: _entity_ # required
      calculation: _calculation_type_ # optional. default: conversion_rate. options: conversions(buys) or conversion_rate (buys/visits) + more to come
      base_measure: _measure_ # required
      conversion_measure: _measure_ # required
      window: _time_window_ # optional. default: inf. window to join the two events on. Follows similar format as time windows elsewhere (i.e. 7 days)
      constant_properties: # optional. List of constant properties default: None
        - base_property: _string_ # required. A reference to a dimension/entity of the semantic model linked to the base_measure
          conversion_property: _string_ # same as base above, but to the semantic model of the conversion_measure

@WilliamDee WilliamDee linked a pull request Dec 11, 2023 that will close this issue
@Jstein77 Jstein77 removed Metricflow Created by Linear-GitHub Sync Metricflow Gap Created by Linear-GitHub Sync labels Feb 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backlog enhancement New feature or request linear multiple_requests Used when multiple users have asked for a feature/enhancement
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants