Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue Job Enqueued count incorrect/mismatch #377

Open
yadurajshakti opened this issue Oct 15, 2024 · 7 comments
Open

Issue Job Enqueued count incorrect/mismatch #377

yadurajshakti opened this issue Oct 15, 2024 · 7 comments

Comments

@yadurajshakti
Copy link

Hi Team,

We are facing this issue of enqueued count incorrect/mismatch as you can see in the screenshot below. We could not find any solution in hangfire forum or github yet. Please help us in providing explanations on this behavior and possible fix.

image

We are using .Net Core 7.0 and PostgreSQL. Hangfire package and version used in the solution are:
"Hangfire.AspNetCore" Version="1.8.0"
"Hangfire.Core" Version="1.8.0"
"Hangfire.PostgreSql" Version="1.19.12"

One observation is: The job is moved from the queue, but its state not changed to either success or failed.
image

@azygis
Copy link
Collaborator

azygis commented Oct 15, 2024

Would be good to provide at least the configuration/setup done for Hangfire. Best case would be a reproducible minimal case.

@yadurajshakti
Copy link
Author

yadurajshakti commented Oct 21, 2024

Hi @azygis

We have done the similar setup as in official website https://docs.hangfire.io/en/latest/getting-started/aspnet-core-applications.html

The dashboard is a separate web application, and we have a console app to schedule the job.

image

Dashboard Configuration:

public void ConfigureServices(IServiceCollection services)
{
    // Add Hangfire services.
    services.AddHangfire(configuration => configuration
        .SetDataCompatibilityLevel(CompatibilityLevel.Version_170)
        .UseSimpleAssemblyNameTypeSerializer()
        .UseRecommendedSerializerSettings()
        .UseSqlServerStorage(Configuration.GetConnectionString("HangfireConnection")));  
}

Then in app configuration

public void Configure(IApplicationBuilder app, IWebHostEnvironment env)
{
app.UseHangfireDashboard(
    "/hangfire",
    new DashboardOptions
    {
        Authorization = new[] { new AuthorizationFilter() },
        AppPath = "/hangfire/home/index"
    },
    new PostgreSqlStorage(Configuration.GetConnectionString("HangfireConnection"))
);
}

Engine configuration responsible for schedule and executing the jobs:

ConfigureHangFire(HostBuilderContext hostContext, IServiceCollection services)
{
            GlobalConfiguration.Configuration.UsePostgreSqlStorage(connectionString, new PostgreSqlStorageOptions
            {
                DistributedLockTimeout = TimeSpan.FromMinutes(5),
                InvisibilityTimeout = TimeSpan.FromMinutes(20)
            });
            GlobalConfiguration.Configuration.UseSimpleAssemblyNameTypeSerializer();
            GlobalConfiguration.Configuration.UseRecommendedSerializerSettings();
            GlobalJobFilters.Filters.Add(new AutomaticRetryAttribute { Attempts = 3 });

            var options = new BackgroundJobServerOptions
            {
                WorkerCount = 1
            };
            
}

@azygis
Copy link
Collaborator

azygis commented Oct 22, 2024

Your dashboard configuration is using SQL Server integration. Are you sure you copied your own configuration and not the one from Hangfire docs?

@yadurajshakti
Copy link
Author

Hi @azygis
Here is the dashboard configuration from my application. We are using UsePostgreSqlStorage(..) method only.

image

@azygis
Copy link
Collaborator

azygis commented Oct 23, 2024

Can you check the lock table if it has entries? I feel it may have a lock placed there and the application exited in a "kill" fashion which prevented from clearing the locks. Nothing picking up the job for 17 days is usually related to zombie locks.

@yadurajshakti
Copy link
Author

Thanks @azygis
This is happening mostly in our UAT/QA environments where the deployment frequency is higher. Please can you suggest a mechanism to handle such cases.

  • Should we manuly stop hangfire jobs/engine/services during deployment?
  • Do you have any implementation to reset locks and other tables during deployments.

@azygis
Copy link
Collaborator

azygis commented Nov 4, 2024

First of all, do not kill the process. Instead of SIGKILL, send SIGTERM when you are stopping the application. I do not know what you use for stopping it hence that's the first suggestion. Proper shutdown sends cancel request to the processing jobs, as long as you use cancellation tokens. If the cancellation doesn't complete in some seconds (sorry, can't remember, it's like 5 or 10s or some other number) the jobs get terminated. Locks are released right before application stops. If application is straight up killed there's no way for Hangfire to exit cleanly.

As for stopping the engine, that is still not really possible without nasty workarounds. That is not possible to handle by storage provider (which is this repository/library). We cannot clear the locks on startup or anytime random because then we might end up with a broken state.

What we did at work is add an endpoint to the applications to check whether any jobs are enqueued/processing, and if so, wait until all of them are complete before continuing with swarm deployment. Waiting for us is fine, YMMV.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants