Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] Bot that identifies Build Errors in the CI Logs #13827

Closed
1 task done
lupyuen opened this issue Oct 5, 2024 · 8 comments
Closed
1 task done

[FEATURE] Bot that identifies Build Errors in the CI Logs #13827

lupyuen opened this issue Oct 5, 2024 · 8 comments

Comments

@lupyuen
Copy link
Member

lupyuen commented Oct 5, 2024

Is your feature request related to a problem? Please describe.

Right now we spend a lot of time scanning the CI Build Logs to figure out why the build failed: Sample Log for arm-05

CMake Error at /github/workspace/sources/nuttx/build/_deps/mynewt-nimble-subbuild/mynewt-nimble-populate-prefix/src/mynewt-nimble-populate-stamp/download-mynewt-nimble-populate.cmake:170 (message):
  Each download failed!
    error: downloading 'https://github.com/apache/mynewt-nimble/archive/fb15c844542e812ceb49ab5ac8502dc93c167b90.tar.gz' failed

What if our PR Bot could scan the CI Build Logs, identify the error and post as a PR Comment?

Hello! Your CI Build Failed because of this error:
CMake Error at ... error: downloading ...

Describe the solution you'd like

How To Identify Errors

Our Bot will do this with diff, the Conventional Non-AI Way:

  • The above error is from the arm-05 Build Log. Our Bot will search the CI Build Logs, to find the Last Successful Merge Build of arm-05: Last Successful arm-05 Log
  • Our Bot runs a (special kind of) diff, to Compare the Failed Build with the Last Successful Build
  • It will discover the CMake Error, then post it as a PR Comment
  • How is our Bot triggered? It scans the CI Builds every minute, looking for Failed Builds

Optional: Explain the Error with LLM

This is totally optional: Our Bot could pass the error to an LLM and explain it...

LLM Prompt

What's failing in this NuttX Job for Continuous Integration? Please explain concisely

CMake Error at /github/workspace/sources/nuttx/build/_deps/mynewt-nimble-subbuild/mynewt-nimble-populate-prefix/src/mynewt-nimble-populate-stamp/download-mynewt-nimble-populate.cmake:170 (message):
  Each download failed!
    error: downloading 'https://github.com/apache/mynewt-nimble/archive/fb15c844542e812ceb49ab5ac8502dc93c167b90.tar.gz' failed
          status_code: 28
          status_string: "Timeout was reached"
          log:
          --- LOG BEGIN ---
            Trying 140.82.114.3:443...
  Connection timed out after 30001 milliseconds
  Closing connection 0
          --- LOG END ---
FAILED: mynewt-nimble-populate-prefix/src/mynewt-nimble-populate-stamp/mynewt-nimble-populate-download /github/workspace/sources/nuttx/build/_deps/mynewt-nimble-subbuild/mynewt-nimble-populate-prefix/src/mynewt-nimble-populate-stamp/mynewt-nimble-populate-download 
Response from Gemini Pro 1.5

The CI job is failing because it's timing out while trying to download the mynewt-nimble source code from GitHub. This is likely due to a temporary network issue or GitHub server being overloaded.

Concisely: The CI job can't download the required mynewt-nimble source code due to a connection timeout.

Describe alternatives you've considered

No response

Verification

  • I have verified before submitting the report.
@acassis
Copy link
Contributor

acassis commented Oct 6, 2024

@lupyuen I think sometimes the download fails for random network instability. Maybe just adding an way to retry could fix the issue.

@lupyuen
Copy link
Member Author

lupyuen commented Oct 6, 2024

@acassis Yep great idea! I think we need a privileged account to retry the build, I won't think we should run bots with privileged accounts though 🤔

@lupyuen
Copy link
Member Author

lupyuen commented Oct 7, 2024

Wonder if this will work: Instead of a Bot, we add a Job to our CI, that will watch for Timeout Errors and retry the Failed Job: https://stackoverflow.com/a/78314483

name: Retry workflow
on:
    workflow_dispatch:
        inputs:
            run_id:
                required: true
jobs:
    rerun:
        runs-on: ubuntu-latest
        steps:
            - name: rerun ${{ inputs.run_id }}
              env:
                  GH_REPO: ${{ github.repository }}
                  GH_TOKEN: ${{ github.token }}
              run: |
                  ## TODO: Check for timeout errors
                  gh run watch ${{ inputs.run_id }} > /dev/null 2>&1
                  gh run rerun ${{ inputs.run_id }} --failed

@simbit18
Copy link
Contributor

simbit18 commented Oct 7, 2024

perhaps more simply this might help

https://everything.curl.dev/usingcurl/downloads/retry.html

@lupyuen
Copy link
Member Author

lupyuen commented Oct 8, 2024

Any idea which script is calling curl and failing? We should update them to retry. Every day I need to click and manually re-run a few CI Jobs, this is getting tiring 😬

Update: Wonder if it's because we changed wget to curl: #13641 ? I think wget does Retry with Linear Backoff by default?

@simbit18
Copy link
Contributor

simbit18 commented Oct 8, 2024

HI @lupyuen
Which packages give errors?
Which boards do errors occur?

this PR #13641 only concerns installation of dependencies and toolchains not for GITHUB for Ubuntu and generic Linux

@lupyuen
Copy link
Member Author

lupyuen commented Oct 8, 2024

@simbit18 Here are 2 curl errors from today:
https://github.com/apache/nuttx/actions/runs/11229551377/job/31215370724

Configuration/Tool: icicle/rpmsg-sbi
curl: (28) Failed to connect to github.com port 443 after 136303 ms: Connection timed out
make[1]: *** [opensbi/Make.defs:52: opensbi.tar.gz] Error 28

https://github.com/apache/nuttx/actions/runs/11226642457/job/31210068982

Configuration/Tool: esp32-audio-kit/wifi
error: RPC failed; curl 56 GnuTLS recv error (-54): Error in the pull function.
fatal: protocol error: bad pack header

Update: One more from nuttx-apps, but it looks like a git error: https://github.com/apache/nuttx-apps/actions/runs/11226739564/job/31207828223

Configuration/Tool: esp32c3-generic/rmt
fatal: unable to access 'https://github.com/espressif/esp-hal-3rdparty.git/': Failed to connect to github.com port 443 after 133496 ms: Connection timed out

Another one from my repo (is it caused by curl?): https://github.com/lupyuen5/label-nuttx-apps/actions/runs/11230244326/job/31217304232

Configuration/Tool: waveshare-rp2040-lcd-1.28/lvgl,CONFIG_ARM_TOOLCHAIN_GNU_EABI
[v9.1.0.zip]
  End-of-central-directory signature not found.  Either this file is not
  a zipfile, or it constitutes one disk of a multi-part archive.  In the
  latter case the central directory and zipfile comment will be found on
  the last disk(s) of this archive.
unzip:  cannot find zipfile directory in one of v9.1.0.zip or
        v9.1.0.zip.zip, and cannot find v9.1.0.zip.ZIP, period.

@simbit18
Copy link
Contributor

simbit18 commented Oct 8, 2024

This https://github.com/apache/nuttx/actions/runs/11229551377/job/31215370724 is for opensbi.tar.gz
https://github.com/apache/nuttx/blob/master/arch/risc-v/src/opensbi/Make.defs

on nuttx-apps adding the -S option will avoid problems with esp-hal-3rdparty.git PR #13301

I think it is necessary to use the same logic with repeated packages that use curl

@lupyuen lupyuen closed this as completed Oct 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants