Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fail PRs on links that became broken, but create issues for external links that just started breaking? #134

Closed
untitaker opened this issue Jun 18, 2022 · 6 comments
Labels
enhancement New feature or request

Comments

@untitaker
Copy link

Is there an obvious recommended way to report issues for any link that became broken over time, and fail the PR for any introduced links that are immediately broken?

I want to specifically block any PRs that introduce broken links (mostly internal but doesn't have to be), and at the same time not break the master build if a website becomes unavailable.

@mre
Copy link
Member

mre commented Jun 18, 2022

Not that I know of. #17 is at least related. If we introduce an argument for only checking changes, we might as well be able to add one for new files.

@mre
Copy link
Member

mre commented Jun 18, 2022

Once we got that functionality you could have two runs: one for new links with fail: true and one for existing ones with fail: false .

@polarathene
Copy link

polarathene commented Jun 23, 2022

Scope: PR diff (only links from changes)

Your PR triggered workflow can provide lychee-action STDIN input of lines from the diff with status A(added) and M (modified), and lychee should scrape that input for links to check.

It won't work reliably for some link syntax, such as:

  • Markdown docs when the URL is a reference link (eg: [link text][reference] or [link text](#id-reference)), those would require parsing the full document where the existing link is (if lychee supported these link syntax to ensure no typos).
  • Likewise for local/relative links that lack context of location (due to just being lines from STDIN, no file/location associated per link scraped).
  • And you also would not catch failures introduced from a PR when the reference for a link is changed (such as a markdown heading changing which an existing unchanged link points to).

For those it's probably a bit more reliable to at least check an entire document affected by passing the relevant files as inputs to lychee instead of individual lines as a STDIN input. At least proper context for relative paths exists then, and what I assume is MIME or extension detection for supporting other links (STDIN lines parsed content differently than a markdown file in my experience, with no way to hint to lychee for STDIN input)


Scope: Production branch (scheduled checks)

The other concern, a scheduled workflow can run hourly/daily to check the production branch content. If you have any links to Github services that get checked these appear to be rate-limited, and I think that if you hit that it may risk impacting other workflows.

  • Either exclude these links, or use the REST API to query how many requests can be made at the time, and skip if available requests is nearing the rate-limit.
  • Complimentary would be to leverage the --cache option for lychee. With hourly runs, you can pick up where you left off from the last run:
    • This allows for caching a link checked for a longer duration such as 24 hours. If you used multiple hourly runs to workaround rate-limits for large content, you'd have a separate workflow at the end of the day that reports whatever links were found as broken by opening or appending/editing an issue for the repo (I help maintain a repo with scheduled workflows for other maintenance tasks, and PR workflows that use actions to add/update comments, might be helpful reference).
    • If choosing not to skip the workflow run (I don't think lychee can limit requests for a partial check currently), you could leverage --dump option to collect links to a file that gives lychee N lines per run, and uses the CI cache action to persist remaining links across to future runs.

Ignoring known broken links from previous checks

In both cases, you could store the broken links detected on your production branch (and optionally reported on an issue by a bot) into a file that these workflows cache to restore/read from as links to ignore/exclude if necessary.

If broken links are rare, it may be simpler to maintain the exclusion list yourself though.

@asmeurer
Copy link

Is it possible to use the cache feature to require a pre-existing link to be broken for, say, 3 days in a row before reporting it as a failure? That way if a site just goes down for a little bit or just has an intermittent 503 it doesn't cause the link checker to fail.

@mre
Copy link
Member

mre commented Jan 14, 2023

The cache file is to avoid repeated checks. For example, if the cache is valid for 24h and you run lychee multiple times within that time it would not check a link but immediately return the cached result. In that sense it's the opposite of what you want.

Instead, you could run lychee without a cache and store the output in a file, which you can then cache on Github. On the next run you would download that file again and compare the contents with the latest run to determine which links are broken for longer. This would require some manual scripting work.

@mre
Copy link
Member

mre commented Jan 5, 2024

Thank you all for the valuable input on this topic. After reviewing the discussion, it's clear that the core challenge here revolves around managing broken links in PRs and the master build. The conversation has touched on several ideas, including:

  • Blocking PRs with newly introduced broken links, as initially proposed by @untitaker.
  • The possibility of using lychee's cache functionality to avoid repeated checks.
  • Suggestions by @polarathene on handling links in PR diffs and scheduled checks for the production branch, which offer a good balance between thoroughness and efficiency.

However, it's important to note that the implementation of some of these suggestions extends beyond the current scope of the lychee-action. Specifically, the idea of requiring a link to be broken for a certain duration before reporting it as a failure, while interesting, involves a level of complexity and manual scripting that is currently out of scope.

As a workaround, I recommend considering the approach of having two separate runs for link checking: one for new links with fail: true and one for existing links with fail: false. This method would effectively address the immediate concern of blocking PRs with newly introduced broken links, while not breaking the master build if a website becomes unavailable.

Focusing on a workaround seems the best course of action.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants