How can we avoid duplicated code for data steps? #2469

pabloarosado · 2024-03-15T10:37:50Z

pabloarosado
Mar 15, 2024
Collaborator

Summary

Having multiple versions of a dataset is useful.
Having multiple versions of the code is useful when there are differences.
However, having multiple versions of the code is inconvenient when there are no differences.

Moving forward, we have the following options:

We accept the downsides of having duplicated code.
1. We do nothing and accept the use of "latest".
2. We implement One line command for detecting and enqueuing chart revisions #2411 . This could reduce the pain of keeping versions of regularly updated datasets, and hence we could stop using "latest".
We do not accept the downsides of having duplicated code.
1. We accept the current solution in 🎉 Create function to import run from a previous version of a step #2410 with its downsides.
2. We accept the current solution in 🎉 Create function to import run from a previous version of a step #2410 but avoid hidden dependencies. Then we'd need to implement some further logic to StepUpdater and the dashboard.
3. We figure out an alternative solution that lets us have versions of datasets, without the pain of duplicated code.

My current preferences in order would be 2.iii (we could discuss alternatives in a data call) > 2.i > 1.ii > 1.i > 2.ii.

Reasoning

Common situation

This is a common scenario in our data workflow:

Version 1 [data provider makes a release]: We create the code for the step.
Version 2 [data provider makes a new release]: We duplicate the code of Version 1.
Version 3 [data provider makes a new release and changes something in the data]: We duplicate the code of Version 2 and make some small changes to it.
Version 4 [data provider makes a new release]: We duplicate the code of Version 3.

The ratio of how often the code needs changes will depend on the specific dataset. For datasets that are updated regularly, it's less common for the data provider to change things, and therefore, it's more common to have exactly identical versions of data steps.

Is it useful to have multiple versions of the data?

In most cases, when a data provider publishes a new version of a time series, they do not simply add a new set of data points for the new timestamp. They usually change other data points in the past.

Therefore, if something odd is found in the data, it is useful to be able to:
(A) Visualize changes between consecutive versions.
(B) Decouple changes in the code from changes in the data.

We can achieve these things by keeping different versions of the same step. That's a very nice feature of the ETL.

The ETL allows versioned datasets by keeping a self-contained recipe for each data step.

Is it useful to detect "dirty" steps?

The ETL keeps checksums of each code file, so that, if there is any change in the code of a data step, ETL will know that the step and its usages need to be re-run. This is a crucial step for reproducibility. So:
(C) It is useful to be able to detect changes in the code of data steps.

Side note:
In practice, each ETL user only executes the ETL for a specific data step, e.g. etl run some_namespace/some_version/some_step. If that modified step happens to be used by multiple other steps, the user may not be aware of that (they are not necessarily going to read through the dag to realize that). So, in that situation, they may create a PR without knowing that they are affecting many files.

Luckily, in the automatic PR server, datadiff will show that unexpected data steps have changed. In practice, however, datadiff often shows many irrelevant changes (because the master branch has changed). So, ideally, one would rebase and then let datadiff run again. But this may not always occur.

Problems with having repeated code

If we need to do a refactor (e.g. to upgrade a library like pandas) we may need to make a small change in many identical files. This implies more time for the person doing the refactoring.
The previous point may deter us from doing necessary refactorings (therefore increasing the risk of stalled libraries).
When developing data steps, if you search for a function in the code editor, you find multiple instances of the same function, corresponding to multiple versions of the same data step. This takes more time, and worsens the experience of working with the ETL. You may also edit the wrong file, leading to confusion and more time wasted.
If you find something odd in the data and want to take advantage of (B), if you have many files that are (almost) identical, you may need to go one by one to decide if they are actually identical or just very similar.

Could we simply have versioned snapshots and "latest" data steps?

This is one possible approach for datasets that are updated regularly. It has the advantage that:

You reduce the problems derived from having repeated code.
You spare the time of creating new code and doing revisions. NOTE: This would be solved by implementing One line command for detecting and enqueuing chart revisions #2411

But it has downsides:

You don't do chart revisions (the only way we currently have to do them is by having different versions of data steps). This can lead us to publish charts with clear mistakes on them.
You don't keep versions of ETL steps. However, you do keep snapshots of the data, and git tracks changes in the code, so, in principle, you can replicate every published version. In practice, doing this can be very complicated.

Could we have modules shared among different versions?

Let's see what happens in the common scenario described above:

Version 1 [data provider makes a release]: We create the code for the step, and a module that is shared among future versions in the same namespace.
Version 2 [data provider makes a new release]: The code is identical to Version 1, so we just duplicate it. This is not a problem, since the core of the code is in the shared module, which is not duplicated.
Version 3 [data provider makes a new release and changes something in the data]: Some small changes in the shared module are necessary. But we don't want to change Versions 1 and 2! So, we need to refactor Versions 1 and 2 to include the old shared code, while Version 3 (and upcoming versions) still import the modified shared module. Alternatively, we can archive Versions 1 and 2.
Version 4 [data provider makes a new release]: The code is identical to Version 3, so we just duplicate it.

This has some problems:

If we want to keep versions (because of (A)) we need to do additional refactoring of previous steps.
If there are multiple steps in the namespace, changes in the shared module would make all steps dirty, even if those steps are not loading the shared module.
It requires implementing some further logic in ETL.

Could we have steps that import previous steps?

This is what I proposed in: #2410

This gives us the benefits (A) and (B) (and it's easy to spot if something has changed in the code or if the code is identical to a previous version). However, it's unclear how to achieve (C).

Without changing any further ETL logic, we could achieve (C) too, by adding the imported step as a dependency of the new step in the dag. For example:

  # Old step:
  data://meadow/namespace/2024-01-01/step:
    - snapshot://namespace/2024-01-01/step.csv
  data://garden/namespace/2024-01-01/step:
    - data://meadow/namespace/2024-01-01/step
  data://grapher/namespace/2024-01-01/step:
    - data://garden/namespace/2024-01-01/step
  # New step that imports the old step:
  data://meadow/namespace/2024-02-02/step:
    - snapshot://namespace/2024-02-02/step.csv
    - data://meadow/namespace/2024-01-01/step
  data://garden/namespace/2024-02-02/step:
    - data://meadow/namespace/2024-02-02/step
    - data://garden/namespace/2024-01-01/step
  data://grapher/namespace/2024-02-02/step:  # Usually, the code of the grapher step is trivial; no need to import previous.
    - data://garden/namespace/2024-02-02/step

However, this solution creates new problems:

(Very minor) In the new garden step, we would need to specify the channel when loading the step, i.e. load_dataset("step", channel="meadow").
The ETL dashboard will claim that snapshot snapshot://namespace/2024-01-01/step.csv is used by the new grapher step, and therefore still has charts. To keep the current logic of StepUpdater, we would need to implement some further changes.

So, this solution is not very convenient. It's more convenient to allow the new steps to have "hidden dependencies".

I'm sure we can figure out some other way to have (C), but this may require implementing further ETL changes.

## What are the downsides of not tracking changes in steps that import previous versions?

If we allowed steps to load arbitrary pieces of code from previous steps, things would get messy quickly. But what I proposed in #2410 is to import a previous step when the new step is identical to the previous.

Under these conditions, the only risk I can foresee is that, if a user edits an old step, ETL (and datadiff) will not know that the new step should also be re-run.

In practice, editing old steps is uncommon. And, if someone edits an old step, it's usually because of a refactoring, or a bug. In both cases, it's actually convenient that the new step imports the previous step. The only downside is that we need to remember to re-run the new step. This could be achieved locally by adding the --force flag, or (to ensure the steps run in master), we could add a minor change in the new step (e.g. a comment).

Note that, if any other dependency of the new step (different from the hidden dependency) is modified, the new step will become dirty, because the dag tracks all dependencies of the new step.

As @larsyencken pointed out:

The problem it introduces is that you can no longer delete a step, since other steps might be using its code, and that if you're using another step's code, you now depend on that code and have to include it in your manifest for checksumming the step. That seems like quite a downside to swallow.

If you remove an old step that is a hidden dependency of a newer step, indeed, you may not notice until you run the new step in a fortnightly build. However, we have previously decided that we would only archive steps if they fail. Therefore, if the old step fails, the new step would also fail anyway, and should therefore also be archived.

Marigold · 2024-03-18T11:01:15Z

Marigold
Mar 18, 2024
Maintainer

...adding the imported step as a dependency of the new step in the dag

We could add a new type of dependency on file (included in the manifest for checksum) and have

data://garden/namespace/2024-02-02/step:
  - data://meadow/namespace/2024-02-02/step
  - file://garden/namespace/2024-01-01/step/shared.py

0 replies

larsyencken · 2024-03-28T10:24:14Z

larsyencken
Mar 28, 2024
Maintainer

@Marigold noted that archiving things means they don't come up in VSCode search, which is one of the main annoyances of having multiple copies of code around.

Can we still distinguish between things that are runnable vs broken?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How can we avoid duplicated code for data steps? #2469

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments

{{title}}

{{title}}

Select a reply

How can we avoid duplicated code for data steps? #2469

pabloarosado Mar 15, 2024 Collaborator

Summary

Reasoning

Common situation

Is it useful to have multiple versions of the data?

Is it useful to detect "dirty" steps?

Problems with having repeated code

Could we simply have versioned snapshots and "latest" data steps?

Could we have modules shared among different versions?

Could we have steps that import previous steps?

Replies: 2 comments

Marigold Mar 18, 2024 Maintainer

larsyencken Mar 28, 2024 Maintainer

pabloarosado
Mar 15, 2024
Collaborator

Marigold
Mar 18, 2024
Maintainer

larsyencken
Mar 28, 2024
Maintainer