Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better support for resilience-oriented setups #2910

Open
atrauzzi opened this issue Oct 31, 2024 · 3 comments
Open

Better support for resilience-oriented setups #2910

atrauzzi opened this issue Oct 31, 2024 · 3 comments
Labels
enhancement New feature or request scale

Comments

@atrauzzi
Copy link

atrauzzi commented Oct 31, 2024

Better support for resilience-oriented setups

Problem

Not all orchestrators or environments support the notion of dependent services to ensure that things like database come up prior to FusionAuth starting up.

The approach to how this is handled varies by community and one competing school of thought to specifying dependencies is to rely on restarts and/or retries.

Unfortunately, when FusionAuth fails to get a lock on a database in silent, maintenance mode, it does not terminate or make any retry attempts.

This means that an environment that wishes to handle resilience through restarts or in-built retry mechanisms (or both!) has no way of guiding FusionAuth to a working state.

Solution

All or some combination of:

  • An explicit flag to instruct FusionAuth to retry its database connection for a certain period/interval
  • An explicit flag to instruct FusionAuth to terminate when it cannot connect to the database

Optionally, one of these could also be the default when running with FUSIONAUTH_APP_SILENT_MODE set to true.

Alternatives/workarounds

Unfortunately there is no alternative or workaround. In some ways, the premise of resilience is itself a workaround oriented approach.

Community guidelines

All issues filed in this repository must abide by the FusionAuth community guidelines.

How to vote

Please give us a thumbs up or thumbs down as a reaction to help us prioritize this feature. Feel free to comment if you have a particular need or comment on how this feature should work.

@mooreds
Copy link
Collaborator

mooreds commented Nov 1, 2024

An explicit flag to instruct FusionAuth to retry its database connection for a certain period/interval

We currently support this via theDATABASE_CONNECTION_TIMEOUT, as documented here: https://fusionauth.io/docs/reference/configuration

An explicit flag to instruct FusionAuth to terminate when it cannot connect to the database

Do you mean a configuration parameter that, when set, causes FusionAuth to fail hard and refuse requests when it can't reach a database?

@atrauzzi
Copy link
Author

atrauzzi commented Nov 1, 2024

Hmm, a "timeout" typically means how long before the connection is failed. Either way, I don't think the discrepancy here is around whether there is something triggering timeout behaviour. Although just for posterity, I've tried setting FUSIONAUTH_DATABASE_CONNECTION_TIMEOUT to 30000, just to see what happens:

Log output
2024-11-01T06:36:40.3868140 ---------------------------------------------------------------------------------------------------------
2024-11-01T06:36:40.3868460 ---------------------------------- Entering Silent Configuration Mode -----------------------------------
2024-11-01T06:36:40.3868690 ---------------------------------------------------------------------------------------------------------
2024-11-01T06:36:40.3868950 
2024-11-01T06:36:40.4545260 2024-11-01 11:36:40.451 AM ERROR com.inversoft.maintenance.db.DatabaseSilentModeWorkflowTask - Encountered an error while running silent mode
2024-11-01T06:36:40.4547090 java.lang.IllegalStateException: Unable to capture database lock. This indicates that the database either doesn't support locks or is misconfigured.
2024-11-01T06:36:40.4547660 	at com.inversoft.maintenance.db.JDBCMaintenanceModeDatabaseService.lockDatabase(JDBCMaintenanceModeDatabaseService.java:322)
2024-11-01T06:36:40.4548170 	at com.inversoft.maintenance.db.DatabaseSilentModeWorkflowTask.perform(DatabaseSilentModeWorkflowTask.java:43)
2024-11-01T06:36:40.4548530 	at com.inversoft.maintenance.DefaultMaintenanceModeWorkflow.performSilentConfiguration(DefaultMaintenanceModeWorkflow.java:47)
2024-11-01T06:36:40.4548880 	at com.inversoft.maintenance.BaseMaintenanceModePrimeMain.modules(BaseMaintenanceModePrimeMain.java:70)
2024-11-01T06:36:40.4549610 	at org.primeframework.mvc.BasePrimeMain.hup(BasePrimeMain.java:69)
2024-11-01T06:36:40.4550010 	at org.primeframework.mvc.BasePrimeMain.start(BasePrimeMain.java:100)
2024-11-01T06:36:40.4550370 	at io.fusionauth.app.FusionAuthMain.main(FusionAuthMain.java:27)
2024-11-01T06:36:40.4550760 Caused by: org.postgresql.util.PSQLException: The connection attempt failed.
2024-11-01T06:36:40.4551060 	at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:358)
2024-11-01T06:36:40.4552080 	at org.postgresql.core.ConnectionFactory.openConnection(ConnectionFactory.java:54)
2024-11-01T06:36:40.4552480 	at org.postgresql.jdbc.PgConnection.<init>(PgConnection.java:273)
2024-11-01T06:36:40.4552860 	at org.postgresql.Driver.makeConnection(Driver.java:446)
2024-11-01T06:36:40.4553230 	at org.postgresql.Driver.connect(Driver.java:298)
2024-11-01T06:36:40.4553600 	at java.sql/java.sql.DriverManager.getConnection(DriverManager.java:683)
2024-11-01T06:36:40.4553970 	at java.sql/java.sql.DriverManager.getConnection(DriverManager.java:230)
2024-11-01T06:36:40.4554350 	at com.inversoft.maintenance.db.JDBCMaintenanceModeDatabaseService.lockDatabase(JDBCMaintenanceModeDatabaseService.java:304)
2024-11-01T06:36:40.4555020 	... 6 common frames omitted
2024-11-01T06:36:40.4555340 Caused by: java.io.EOFException: null
2024-11-01T06:36:40.4555610 	at org.postgresql.core.PGStream.receiveChar(PGStream.java:469)
2024-11-01T06:36:40.4555840 	at org.postgresql.core.v3.ConnectionFactoryImpl.enableSSL(ConnectionFactoryImpl.java:594)
2024-11-01T06:36:40.4556060 	at org.postgresql.core.v3.ConnectionFactoryImpl.tryConnect(ConnectionFactoryImpl.java:195)
2024-11-01T06:36:40.4556310 	at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:262)
2024-11-01T06:36:40.4556550 	... 13 common frames omitted
2024-11-01T06:36:40.7556620 2024-11-01 11:36:40.755 AM INFO  io.fusionauth.api.configuration.DefaultFusionAuthConfiguration - Loading FusionAuth configuration file [/usr/local/fusionauth/config/fusionauth.properties]
2024-11-01T06:36:40.7567210 2024-11-01 11:36:40.756 AM INFO  io.fusionauth.api.configuration.DefaultFusionAuthConfiguration - Dynamically set property [fusionauth-app.url] set to [http://192.168.1.116:9011/]
2024-11-01T06:36:40.7568350 2024-11-01 11:36:40.756 AM INFO  com.inversoft.configuration.BasePropertiesFileInversoftConfiguration - 
2024-11-01T06:36:40.7568950   - Overriding default value of property [database.mysql.enforce-utf8mb4] with value [true]
2024-11-01T06:36:40.7569460   - Overriding default value of property [fusionauth-app.runtime-mode] with value [development]
2024-11-01T06:36:40.7569740   - Overriding default value of property [search.type] with value [database]
2024-11-01T06:36:40.7570100 
2024-11-01T06:36:40.8865620 2024-11-01 11:36:40.886 AM INFO  com.inversoft.maintenance.MaintenanceModePoller - Poller started to Wait for configuration to be completed.
2024-11-01T06:36:40.8887740 2024-11-01 11:36:40.888 AM INFO  io.fusionauth.app.primeframework.FusionHTTPContextAuthSetup - Initializing the FusionAuth HTTP Context.
2024-11-01T06:36:40.8987550 2024-11-01 11:36:40.898 AM INFO  org.primeframework.mvc.PrimeMVCRequestHandler - Initializing Prime
2024-11-01T06:36:40.9000280 2024-11-01 11:36:40.899 AM INFO  org.primeframework.mvc.PrimeMVCRequestHandler - Initializing Prime
2024-11-01T06:36:40.9008700 2024-11-01 11:36:40.900 AM INFO  io.fusionauth.http.server.HTTPServer - Starting the HTTP server. Buckle up!
2024-11-01T06:36:40.9087310 2024-11-01 11:36:40.908 AM INFO  io.fusionauth.http.server.HTTPServer - HTTP server listening on port [9011]
2024-11-01T06:36:40.9093130 2024-11-01 11:36:40.908 AM INFO  io.fusionauth.http.server.HTTPServer - HTTP server started successfully
2024-11-01T06:36:40.9094320 2024-11-01 11:36:40.908 AM INFO  io.fusionauth.http.server.HTTPServer - Starting the HTTP server. Buckle up!
2024-11-01T06:36:40.9094930 2024-11-01 11:36:40.909 AM INFO  io.fusionauth.http.server.HTTPServer - HTTP server listening on port [9012]
2024-11-01T06:36:40.9095420 2024-11-01 11:36:40.909 AM INFO  io.fusionauth.http.server.HTTPServer - HTTP server started successfully

The server goes into silent configuration mode and never recovers, and then never runs my kickstart.


Now, if I do the following temporary workaround:

  • Entrypoint: bash
  • Arguments: -c sleep 30 && /usr/local/fusionauth/fusionauth-app/bin/start.sh

After thirty seconds, FusionAuth starts and is able to connect to the database. So I know I have the potential for a working FusionAuth configuration and that there's nothing wrong with my setup. It's merely a matter of convincing FusionAuth to actually retry.


So, coming back to what you mention above - and provided I have the config value name correct - I'm not sure the timeout is either working or necessarily the right solution.

The central question here may not even be around whether there is a timeout configured, but more around what happens after a timeout.

In this, FusionAuth could present a number of behaviours, but ideally it might be good to allow people to pick which one is best for their environment. Especially when you include the first time out of box setup experience that FusionAuth offers, which in my scenario is actually not helpful because I'm using kickstart and API calls to complete configuration non-interactively.

@mooreds
Copy link
Collaborator

mooreds commented Nov 4, 2024

Thanks @atrauzzi .

I did some digging and it looks like the FUSIONAUTH_DATABASE_CONNECTION_TIMEOUT variable doesn't apply at startup when we're in maintenance mode, trying to find a database to connect to, only to connections after startup, when the database connection is managed by our connection pool.

We do try to reconnect multiple times when starting up but I'll take a closer look.

@mooreds mooreds added enhancement New feature or request and removed needs more info feature labels Nov 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request scale
Projects
None yet
Development

No branches or pull requests

2 participants