Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix loki source file windows #2282

Closed
wants to merge 6 commits into from
Closed

Conversation

wildum
Copy link
Contributor

@wildum wildum commented Dec 13, 2024

PR Description

On Windows, when a file that loki.source.file was tailing is deleted, the tailer will be stopped. On Unix-like systems, the tailer will stay alive and keep retrying to open the file.

This is ok in most cases, but in the specific scenario where you have a loki.source.file receiving targets from a local.file_match component and that you rotate the file by deleting/re-creating it with the same name and within the same the sync_period, the local.file_match won't send an update because the targets are still the same. As a result, on Windows the loki.source.file will not tail the file anymore. (On Unix-like system, it would recover because it would manage to re-open the file).

This PR adds an optional argument retry_interval (default 0) to restart the readers on a tick when they stop. By default, there is no tick so that existing implementations are not affected.

Which issue(s) this PR fixes

Notes to the Reviewer

PR Checklist

  • CHANGELOG.md updated
  • Documentation added
  • Tests updated
  • Config converters updated

Copy link
Contributor

github-actions bot commented Dec 13, 2024

💻 Deploy preview deleted.

@wildum wildum marked this pull request as ready for review December 13, 2024 15:00
@wildum wildum requested review from clayton-cornell and a team as code owners December 13, 2024 15:00
Copy link
Collaborator

@mattdurham mattdurham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -143,6 +144,24 @@ func (c *Component) Run(ctx context.Context) error {
c.mut.RUnlock()
}()

tickerChan := make(chan struct{})
if c.args.RetryInterval > 0 {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesnt handle updating the RetryInterval after running, can we add a test for that?

Copy link
Collaborator

@mattdurham mattdurham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs to handle updating


The `encoding` argument must be a valid [IANA encoding][] name. If not set, it
defaults to UTF-8.

You can use the `tail_from_end` argument when you want to tail a large file without reading its entire content.
When set to true, only new logs will be read, ignoring the existing ones.

`retry_interval` is deactivated by default (`"0s"`). This should be set on Windows-based systems when rotating files with the same names because the
component will no try to re-open the files and the targets won't be updated because of the cache.
This is not needed on Unix-like systems because the component will always try to re-open the deleted files.
Copy link
Contributor

@clayton-cornell clayton-cornell Dec 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Drafting up a new suggestion.

We can state the specific OSes we support.

This sentence (and the overall description) isn't really clear to me. Why will loki.source_file try to reopen deleted files? The description in the table says the arg will tell the component to try to reopen closed files. Which is it? I assume it's closed files since trying to reopen deleted files doesn't really make sense. Why won't the component try to reopen closed files on Windows? It's also not really clear to me. Windows caches the files? And the other OSes do not?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a case where a user is rotating a "app.log" file everytime it reaches 5MB by deleting it and creating a new app.log file.

On Windows you must first close the file before deleting it. Closing the file will stop the tailer in the loki.source.file component. If the new file would have a different name, the local.file_match would send it to the loki.source.file component and it would work fine. But for the local.file_match component nothing changed because it only checks at a regular interval and it will just see that there is still an "app.log" file. Because of this, it won't notify the loki.source.file component to start a new tailer and the file won't be tailed (it only notifies the next component when the set of files changed).

On Unix-like systems, you can delete the file without closing it. Because the file is not closed, the tailer is not stopped, it continuously tries to read the file until the local.file_match sends an update. In the case of the user, it will restart the tailer once the new "app.log" file is created because of the names it thinks that it's the same file.

@wildum
Copy link
Contributor Author

wildum commented Dec 16, 2024

I made it updatable but the tests are flaky and I'm not sure why... Also I'm not super happy with the solution, adding an argument for an edge case on Windows does not feel great. I'm tempted to just put a Note in the doc for now and maybe we could plan a refactor of this component to use a pool of worker as you suggested and at the same time see whether we can tackle this edge case in a different way @mattdurham ?

@mattdurham
Copy link
Collaborator

Not in love with it either, but it does fix an underlying problem. Would also be pro pulling back a bit and trying to figure out if there is a cleaner way.

@dehaansa
Copy link
Contributor

Could we solve this better by updating the file_match component to add a created_at timestamp to the discovered target map? This would allow consumers of the target like loki source file to identify new files.

Looks like it's possible to get, though we'll need to add os specific implementations to the file_match component (https://stackoverflow.com/questions/59413768/how-to-find-file-creation-date-in-windows-using-go)

@wildum
Copy link
Contributor Author

wildum commented Jan 2, 2025

Could we solve this better by updating the file_match component to add a created_at timestamp to the discovered target map? This would allow consumers of the target like loki source file to identify new files.

That's an interesting alternative but this makes the component dependent on this label. It would not work if you hardcode the target in the config directly. I think that the problem should be solved in the loki.source.file component directly

Because this problem is not urgent (the easy workaround is to change the name of the file you rotate), I would rather consider planning a refactor of the component to address scaling issues and set up a retry mechanism

@wildum wildum closed this Jan 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants