Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change ordering of Matrixed TaskRuns Produced to Not Be Deterministic #7056

Closed
Tracked by #5265
EmmaMunley opened this issue Aug 21, 2023 · 15 comments
Closed
Tracked by #5265
Labels
kind/feature Categorizes issue or PR as related to a new feature.

Comments

@EmmaMunley
Copy link
Contributor

This issue is tracking the discussion regarding the fact that TaskRuns are currently fanned out from a Matrixed PipelineTask in a deterministic order with a suggestion of wanting to change it to be non-deterministic so that users won't rely on the ordering. This was originally brought up when discussing TEP-140: Produce Results from Matrixed PipelineTasks: tektoncd/community#1047 (comment)

Screenshot 2023-08-21 at 11 06 11 AM

This issue should be resolved before promoting Matrix to Beta #5265

@EmmaMunley EmmaMunley added the kind/feature Categorizes issue or PR as related to a new feature. label Aug 21, 2023
@EmmaMunley
Copy link
Contributor Author

@lbernick @pritidesai @jerop

@lbernick
Copy link
Member

Original discussion where this was introduced: #6312 (comment)
Note that Golang randomizes iteration order over map keys for a similar reasons: https://go.dev/doc/go1#iteration

@pritidesai
Copy link
Member

Yes absolutely, ordering over map is randomized but not an array. Fanning out feature is equivalent to an array or a list. This is not a beta blocker.

@lbernick Please explain the use cases for your concerns. Thanks!

@lbernick
Copy link
Member

I'd like to have the possibility of changing our matrix implementation in the future. If users expect that taskruns are fanned out in a deterministic order, they will likely come to rely on this behavior, making it much harder for us to change it if we want. We can always go from randomized ordering to deterministic ordering later, but we cannot go from deterministic order to randomized order if users depend on deterministic order. Therefore, I believe we should start with randomized order, and only introduce deterministic order if there's a strong use case/need for it.

In terms of use cases: There's no use case I'm aware of for pipeline authors needing randomized ordering. The use case is for Tekton developers wanting to change implementation of fanned out tasks. I don't know what we might need to change, but here's a straw man example:

Consider a matrixed taskrun that produces an array result, which is then fanned out into another matrixed taskrun. We will probably initially implement this to wait until all matrixed taskruns are complete, and then fan out the second task in the same order as the first task. But what if we want to improve performance, and start creating matrixed taskruns for the second task as soon as the string results from some of the previous matrixed taskruns are ready? If we guarantee deterministic ordering, we might not be able to make this change.

While deterministic ordering makes testing easier, I'm not aware of any use cases for it. If we do want to support deterministic ordering as a feature, I think we should have clear use cases in mind, and make sure deterministic ordering is the best way to address them.

I believe this is a beta blocker, since alpha features "may be dropped or backwards incompatible changes made at any time", while for beta features, "these features will not be dropped, though the details may change." If deterministic ordering is considered a feature, rather than an implementation detail, promoting matrix to beta could signify that we will not remove deterministic ordering. Therefore, I think we should make a decision on this before beta promotion.

@pritidesai
Copy link
Member

The use case is for Tekton developers wanting to change implementation of fanned out tasks.

What could be the motivation for pipeline developers wanting to change implementation of fanned out tasks?

Consider a matrixed taskrun that produces an array result, which is then fanned out into another matrixed taskrun. We will probably initially implement this to wait until all matrixed taskruns are complete, and then fan out the second task in the same order as the first task. But what if we want to improve performance, and start creating matrixed taskruns for the second task as soon as the string results from some of the previous matrixed taskruns are ready? If we guarantee deterministic ordering, we might not be able to make this change.

The use case you have included is not clear to me. Let's say we have two tasks in a pipeline.

    - name: kaniko-build
      taskRef:
        name: kaniko
      matrix:
        params:
          - name: IMAGE
            value: [ "image-1", "image-2", "image-3" ]
    - name: task-deploy-images
      taskRef:
        name: deploy
        matrix:
          params:
            - name: DIGEST
              value: $(tasks.kaniko-build.results.IMAGE-DIGEST[*]) 

Now, the use case you are describing includes scheduling an instance of task-deploy-images as soon as an instance with running an image-1 is done. Is that correct?

While deterministic ordering makes testing easier, I'm not aware of any use cases for it.

what kind testing are you referring to?

Let me take some time to understand your concerns with matrix.

Let's continue with the same example.

    - name: kaniko-build
      taskRef:
        name: kaniko
      matrix:
        params:
          - name: IMAGE
            value: [ "image-1", "image-2", "image-3" ]

Now, the param IMAGE is an array. When you index into IMAGE, would the first element be always image-1 or it changes everytime you access IMAGE[0]?

By following the standard programming language constructs, IMAGE[0] will always return image-1. So the first instance of the taskRun will return the DIGEST for the first element in an array.

Now, making it a little bit more complex, let's say:

    - name: kaniko-build
      taskRef:
        name: kaniko
      matrix:
        params:
          - name: IMAGE
            value: [ "image-1", "image-2", "image-3" ]
          - name: OS
            value: ["linux", "windows"]

matrix is a list of params i.e it is an array of objects. Is the map of IMAGE always the first element of matrix?

Ranging over an array matrix returns IMAGE as the first element of matrix in FanOut and OS as the second element of an array.

Now, the first instance of this matrix is executed with image-1 and linux.

Change ordering of Matrixed TaskRuns Produced to Not Be Deterministic

Are you suggesting to randomize ranging over an array using rand?

@lbernick
Copy link
Member

What could be the motivation for pipeline developers wanting to change implementation of fanned out tasks?

It's hard for me to predict what changes we want to make in the future, but I think it's important to leave ourselves the option of doing so. We can always re-introduce sorting later, but it's hard to take away. The example I gave was just an attempt to brainstorm a reason we might want to change the implementation. I don't think we need to know that we will change the implementation; I think we just shouldn't lock ourselves into this implementation without a good reason.

While deterministic ordering makes testing easier, I'm not aware of any use cases for it.

what kind testing are you referring to?

Just tekton pipelines unit tests. However, I don't think easier testing (by itself) is a good reason to have deterministic ordering.

Let me take some time to understand your concerns with matrix.

Let's continue with the same example.

    - name: kaniko-build
      taskRef:
        name: kaniko
      matrix:
        params:
          - name: IMAGE
            value: [ "image-1", "image-2", "image-3" ]

Now, the param IMAGE is an array. When you index into IMAGE, would the first element be always image-1 or it changes everytime you access IMAGE[0]?

By following the standard programming language constructs, IMAGE[0] will always return image-1. So the first instance of the taskRun will return the DIGEST for the first element in an array.

Now, making it a little bit more complex, let's say:

    - name: kaniko-build
      taskRef:
        name: kaniko
      matrix:
        params:
          - name: IMAGE
            value: [ "image-1", "image-2", "image-3" ]
          - name: OS
            value: ["linux", "windows"]

matrix is a list of params i.e it is an array of objects. Is the map of IMAGE always the first element of matrix?

Ranging over an array matrix returns IMAGE as the first element of matrix in FanOut and OS as the second element of an array.

Now, the first instance of this matrix is executed with image-1 and linux.

I agree that typical expectations are that array ordering is preserved. However, I don't think there are typical expectations when it comes to iteration over 2D (or 3D etc) arrays (e.g. should the next combo be "image-2" and "linux" or "image-1" and "windows"?), and it's not clear what the ordering should be if there's an include statement as well.

Are you suggesting to randomize ranging over an array using rand?

No, I'm just suggesting we remove our sorting logic and document that matrix ordering is not guaranteed.

@lbernick lbernick mentioned this issue Jun 23, 2023
7 tasks
@pritidesai
Copy link
Member

I agree that typical expectations are that array ordering is preserved. However, I don't think there are typical expectations when it comes to iteration over 2D (or 3D etc) arrays (e.g. should the next combo be "image-2" and "linux" or "image-1" and "windows"?), and it's not clear what the ordering should be if there's an include statement as well.

We are not reinventing here, this is a standard programming paradigm. The entire language specifications for include is described in the proposal. Please refer to the TEP to understand how the combinations are derived.

Just tekton pipelines unit tests. However, I don't think easier testing (by itself) is a good reason to have deterministic ordering.

No, I'm just suggesting we remove our sorting logic and document that matrix ordering is not guaranteed.

This sorting logic has no impact or influence on the matrix ordering. This sorting logic was introduced to order each combination object e.g. a combination with multiple parameters returns the same order of params such that comparing combinations does not cause any failures due to mismatch order.

Let's take an existing example with the latest pipelines release:

Here is the list of taskRuns created:

🗂  Taskruns

 NAME                                           TASK NAME                STARTED          DURATION   STATUS
 ∙ matrixed-pr-scd79-matrix-and-params-2        matrix-and-params        21 seconds ago   20s        Succeeded
 ∙ matrixed-pr-scd79-matrix-and-params-1        matrix-and-params        21 seconds ago   19s        Succeeded
 ∙ matrixed-pr-scd79-platforms-and-browsers-8   platforms-and-browsers   22 seconds ago   20s        Succeeded
 ∙ matrixed-pr-scd79-matrix-and-params-0        matrix-and-params        22 seconds ago   21s        Succeeded
 ∙ matrixed-pr-scd79-platforms-and-browsers-6   platforms-and-browsers   23 seconds ago   15s        Succeeded
 ∙ matrixed-pr-scd79-platforms-and-browsers-7   platforms-and-browsers   23 seconds ago   21s        Succeeded
 ∙ matrixed-pr-scd79-platforms-and-browsers-4   platforms-and-browsers   24 seconds ago   12s        Succeeded
 ∙ matrixed-pr-scd79-platforms-and-browsers-5   platforms-and-browsers   24 seconds ago   15s        Succeeded
 ∙ matrixed-pr-scd79-platforms-and-browsers-3   platforms-and-browsers   24 seconds ago   13s        Succeeded
 ∙ matrixed-pr-scd79-platforms-and-browsers-1   platforms-and-browsers   25 seconds ago   17s        Succeeded
 ∙ matrixed-pr-scd79-platforms-and-browsers-0   platforms-and-browsers   25 seconds ago   16s        Succeeded
 ∙ matrixed-pr-scd79-platforms-and-browsers-2   platforms-and-browsers   25 seconds ago   19s        Succeeded

Now, let's see what were the combination used for each instance:

k get tr matrixed-pr-scd79-platforms-and-browsers-0 -o json | jq .spec.params 
[
  {
    "name": "browser",
    "value": "chrome"
  },
  {
    "name": "platform",
    "value": "linux"
  }
]

k get tr matrixed-pr-scd79-platforms-and-browsers-1 -o json | jq .spec.params
[
  {
    "name": "browser",
    "value": "chrome"
  },
  {
    "name": "platform",
    "value": "mac"
  }
]

k get tr matrixed-pr-scd79-platforms-and-browsers-2 -o json | jq .spec.params
[
  {
    "name": "browser",
    "value": "chrome"
  },
  {
    "name": "platform",
    "value": "windows"
  }
]

k get tr matrixed-pr-scd79-platforms-and-browsers-3 -o json | jq .spec.params           
[
  {
    "name": "browser",
    "value": "safari"
  },
  {
    "name": "platform",
    "value": "linux"
  }
]

k get tr matrixed-pr-scd79-platforms-and-browsers-4 -o json | jq .spec.params
[
  {
    "name": "browser",
    "value": "safari"
  },
  {
    "name": "platform",
    "value": "mac"
  }
]

k get tr matrixed-pr-scd79-platforms-and-browsers-5 -o json | jq .spec.params
[
  {
    "name": "browser",
    "value": "safari"
  },
  {
    "name": "platform",
    "value": "windows"
  }
]

Note the combinations, first three taskRuns 0 through 2 has browser set to chrome with all three platforms [ linux, mac, and windows ] just like when you iterate over an array.

The next three taskRuns 3 through 5 has browser set to safari with all three platforms in order [ linux, mac, and windows ].

Now, I deleted the sorting logic and ran the same example:

k get tr matrixed-pr-mnjbl-platforms-and-browsers-0 -o json | jq .spec.params
[
  {
    "name": "browser",
    "value": "chrome"
  },
  {
    "name": "platform",
    "value": "linux"
  }
]

k get tr matrixed-pr-mnjbl-platforms-and-browsers-1 -o json | jq .spec.params
[
  {
    "name": "browser",
    "value": "chrome"
  },
  {
    "name": "platform",
    "value": "mac"
  }
]

k get tr matrixed-pr-mnjbl-platforms-and-browsers-2 -o json | jq .spec.params
[
  {
    "name": "browser",
    "value": "chrome"
  },
  {
    "name": "platform",
    "value": "windows"
  }
]

k get tr matrixed-pr-mnjbl-platforms-and-browsers-3 -o json | jq .spec.params
[
  {
    "name": "browser",
    "value": "safari"
  },
  {
    "name": "platform",
    "value": "linux"
  }
]

k get tr matrixed-pr-mnjbl-platforms-and-browsers-4 -o json | jq .spec.params
[
  {
    "name": "browser",
    "value": "safari"
  },
  {
    "name": "platform",
    "value": "mac"
  }
]

k get tr matrixed-pr-mnjbl-platforms-and-browsers-5 -o json | jq .spec.params
[
  {
    "name": "browser",
    "value": "safari"
  },
  {
    "name": "platform",
    "value": "windows"
  }
]

The order is still the same.

@jerop
Copy link
Member

jerop commented Aug 22, 2023

If it helps, Matrix in GitHub Actions inspired Matrix in Tekton. The ordering is also deterministic in GitHub Actions, and it is ordered in the same way as Tekton. Users of Tekton who are familiar with GitHub Actions would likely expect the feature to behave in similar ways.

The blurb below the line below is lifted from this documentation: https://docs.github.com/en/actions/using-jobs/using-a-matrix-for-your-jobs#using-a-matrix-strategy


jobs:
  example_matrix:
    strategy:
      matrix:
        version: [10, 12, 14]
        os: [ubuntu-latest, windows-latest]

The order of the variables in the matrix determines the order in which the jobs are created. The first variable you define will be the first job that is created in your workflow run. For example, the above matrix will create the jobs in the following order:

{version: 10, os: ubuntu-latest}
{version: 10, os: windows-latest}
{version: 12, os: ubuntu-latest}
{version: 12, os: windows-latest}
{version: 14, os: ubuntu-latest}
{version: 14, os: windows-latest}

@pritidesai
Copy link
Member

thanks @jerop yes it is an industry standard and common across many CI/CD platforms.

@lbernick
Copy link
Member

Now, I deleted the sorting logic and ran the same example:

The order is still the same.

Thank you for testing this out; I didn't realize this sorting logic was not what was causing the deterministic ordering.

@jerop thanks for the info about GHA; do you happen to have any insight on why they chose to support this? (I understand this might not be publicly available info.)

I am not opposed to deterministic ordering, but I want to understand the use cases that exist for it before users come to rely on it.

@pritidesai
Copy link
Member

I am not opposed to deterministic ordering

Thanks a bunch for clarifying this. May be I misread this issue suggesting to change the deterministic order. Now that since you not opposed to it, can we please move this out as a blocker from #5265 (comment). Thanks!

but I want to understand the use cases that exist for it before users come to rely on it.

Please clarify what exactly you would like to understand, I am happy to help!

@lbernick
Copy link
Member

I still believe this is important to decide before beta, for the reasons I listed in this comment. Would a clearer title for this issue be "decide whether to preserve deterministic ordering of matrixed taskruns"?

I see a few options:

  • If there's a clear use case for deterministic ordering, move forward with that
  • If we don't think there's a clear use case, make ordering randomized, since we can always revert to deterministic later on
  • Agree to disagree, and make a decision on the way forward via a method other than consensus

I'd like to understand an example of a CI/CD workflow that would be much more difficult to implement if matrixed ordering was not deterministic.

@jerop
Copy link
Member

jerop commented Aug 23, 2023

@jerop thanks for the info about GHA; do you happen to have any insight on why they chose to support this? (I understand this might not be publicly available info.)

I cannot speak to why GitHub Actions chose to support it, but I can see the predictability of the matrix being useful for users

Users can rely on the deterministic ordering to automate progressive deployments to different environments by using a matrix and limiting concurrency to one: https://cbrgm.net/post/2022-01-12-github-actions-environments/

...
jobs:
  deploy:
    runs-on: ubuntu-latest
    strategy:
      matrix:
        stage: ['development', 'integration', 'production']
      fail-fast: true
      max-parallel: 1
    steps:
      - name: do some stuff
        uses: ...
        with: ...

In TEP-0090, we indicated interest in adding similar concurrency controls in Matrix in Tekton

"When a Matrix in PipelineTask would generate more than the maximum TaskRuns or Runs, this would fail the Pipeline in the first iteration. After initial usage of Matrix, we can explore other ways of supporting usage beyond that limit, such as allowing TaskRuns or Runs only up to the limit to run at a time, in a follow-up TEP."

While looking into the deterministic ordering in GHA matrix, I ran into reported issues with comments from frustrated users when the ordering was broken: actions/runner#703, actions/runner#483. The examples mentioned in the issues are depending on matrix and concurrency to automate progressive processes.

# running two tests at a time, then progress to more expensive tests only when the previous were successful
...
jobs:
  tests:
    runs-on: ubuntu-latest
    strategy:
      matrix:
        test: ['simple', 'groovy', 'filesystem', 'webflow', 'webflowconfig', 'attributes']
      fail-fast: true
      max-parallel: 2
    steps:
      - name: run the tests
        uses: ...
        with: ...
# changing terraform workspace one at a time, starting with dev then staging
...
jobs:
  deploy:
    runs-on: ubuntu-latest
    strategy:
      matrix:
        workspace: ['dev-us-east-1', 'stg-us-west-2']
      fail-fast: true
      max-parallel: 1
    steps:
      - name: change terraform workspace
        uses: ...
        with: ...

@lbernick
Copy link
Member

Thanks very much for this info @jerop, this is exactly what I was looking for, super helpful!

I agree this is an important use case to address, although I am not sure that deterministic ordering is the best way to address it. However, I agree that users who are used to GHA syntax may be surprised to find that ours differs. I am curious to hear what other maintainers think.

Since it sounds like IBM has a strong need for this feature, and I don't have a strong stake in it, I am not opposed to moving forward.

@pritidesai
Copy link
Member

Thanks a bunch @EmmaMunley for your continuous support on making this feature possible!

In addition to IBM, every small and big organization which has deployed Tekton for their CI/CD needs will be benefited with this feature.

@lbernick please feel free to reopen this whenever you find some stake in it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature.
Projects
None yet
Development

No branches or pull requests

4 participants