How can we avoid duplicated code for data steps? #2469
pabloarosado
started this conversation in
Ideas
Replies: 2 comments
-
We could add a new type of dependency on file (included in the manifest for checksum) and have
|
Beta Was this translation helpful? Give feedback.
0 replies
-
@Marigold noted that archiving things means they don't come up in VSCode search, which is one of the main annoyances of having multiple copies of code around. Can we still distinguish between things that are runnable vs broken? |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Summary
Having multiple versions of a dataset is useful.
Having multiple versions of the code is useful when there are differences.
However, having multiple versions of the code is inconvenient when there are no differences.
Moving forward, we have the following options:
StepUpdater
and the dashboard.My current preferences in order would be 2.iii (we could discuss alternatives in a data call) > 2.i > 1.ii > 1.i > 2.ii.
Reasoning
Common situation
This is a common scenario in our data workflow:
The ratio of how often the code needs changes will depend on the specific dataset. For datasets that are updated regularly, it's less common for the data provider to change things, and therefore, it's more common to have exactly identical versions of data steps.
Is it useful to have multiple versions of the data?
In most cases, when a data provider publishes a new version of a time series, they do not simply add a new set of data points for the new timestamp. They usually change other data points in the past.
Therefore, if something odd is found in the data, it is useful to be able to:
(A) Visualize changes between consecutive versions.
(B) Decouple changes in the code from changes in the data.
We can achieve these things by keeping different versions of the same step. That's a very nice feature of the ETL.
The ETL allows versioned datasets by keeping a self-contained recipe for each data step.
Is it useful to detect "dirty" steps?
The ETL keeps checksums of each code file, so that, if there is any change in the code of a data step, ETL will know that the step and its usages need to be re-run. This is a crucial step for reproducibility. So:
(C) It is useful to be able to detect changes in the code of data steps.
Problems with having repeated code
pandas
) we may need to make a small change in many identical files. This implies more time for the person doing the refactoring.Could we simply have versioned snapshots and "latest" data steps?
This is one possible approach for datasets that are updated regularly. It has the advantage that:
But it has downsides:
Could we have modules shared among different versions?
Let's see what happens in the common scenario described above:
This has some problems:
Could we have steps that import previous steps?
This is what I proposed in: #2410
This gives us the benefits (A) and (B) (and it's easy to spot if something has changed in the code or if the code is identical to a previous version). However, it's unclear how to achieve (C).
Without changing any further ETL logic, we could achieve (C) too, by adding the imported step as a dependency of the new step in the dag. For example:
However, this solution creates new problems:
load_dataset("step", channel="meadow")
.snapshot://namespace/2024-01-01/step.csv
is used by the new grapher step, and therefore still has charts. To keep the current logic ofStepUpdater
, we would need to implement some further changes.So, this solution is not very convenient. It's more convenient to allow the new steps to have "hidden dependencies".
I'm sure we can figure out some other way to have (C), but this may require implementing further ETL changes.
## What are the downsides of not tracking changes in steps that import previous versions?
If we allowed steps to load arbitrary pieces of code from previous steps, things would get messy quickly. But what I proposed in #2410 is to import a previous step when the new step is identical to the previous.
Under these conditions, the only risk I can foresee is that, if a user edits an old step, ETL (and datadiff) will not know that the new step should also be re-run.
In practice, editing old steps is uncommon. And, if someone edits an old step, it's usually because of a refactoring, or a bug. In both cases, it's actually convenient that the new step imports the previous step. The only downside is that we need to remember to re-run the new step. This could be achieved locally by adding the
--force
flag, or (to ensure the steps run in master), we could add a minor change in the new step (e.g. a comment).Note that, if any other dependency of the new step (different from the hidden dependency) is modified, the new step will become dirty, because the dag tracks all dependencies of the new step.
As @larsyencken pointed out:
If you remove an old step that is a hidden dependency of a newer step, indeed, you may not notice until you run the new step in a fortnightly build. However, we have previously decided that we would only archive steps if they fail. Therefore, if the old step fails, the new step would also fail anyway, and should therefore also be archived.
Beta Was this translation helpful? Give feedback.
All reactions